Table of Contents
- What is Apache Spark and why was it developed?
- What is PySpark?
- How is Spark different from Hadoop MapReduce?
- What are the main advantages and limitations of using PySpark?
- Explain Spark’s architecture in simple terms.
- What is SparkContext?
- What is SparkSession and how is it different from SparkContext?
- What are the main Spark components/modules?
- What is an RDD in Spark?
- How do you create an RDD in PySpark?
- What is the difference between RDD transformations and actions?
- When should I prefer RDDs over DataFrames?
- What is the difference between map(), flatMap(), and filter() in PySpark?
- What is the difference between repartition() and coalesce()?
- What is the difference between cache(), persist(), and unpersist()?
- What is the difference between Broadcast variables and Accumulators?
- What is shuffling in Spark and why is it expensive?
- groupByKey() vs reduceByKey() vs aggregateByKey() vs sortByKey()
- Why are DataFrames faster than RDDs?
- Compare RDD, DataFrame, and Dataset.
- What is a Spark SQL DataFrame?
- What is a schema in Spark and why is it important?
- How does groupBy work in Spark DataFrames?
- How do you pivot and unpivot a DataFrame in Spark?
- What is a DAG and what is lineage in Spark?
- What is data skew in Spark and how do you handle it?
- What are common optimization techniques in Spark?
- What is the Catalyst Optimizer?
- What is serialization/deserialization in Spark?
- What serializers does PySpark support?
- What is mapPartitions() and when would you use it?
- What is a broadcast join in Spark?
- What are Spark deployment modes (Client vs Cluster)?
- What is the spark-submit command used for?
- ORC vs Parquet vs CSV vs JSON – which one to choose?
- How do you read a CSV file with a custom delimiter in PySpark?
- How do you deal with bad or corrupt data while reading files?
- Why do out-of-memory (OOM) issues occur in Spark?
- How do you remove duplicate rows in a Spark DataFrame?
- Is PySpark faster than pandas?
- How do you create a Spark DataFrame from different file formats?
- What is a paired RDD?
- What is the difference between Star Schema and Snowflake Schema?
- How do you track failed jobs or stages in Spark?
- What are some real-world Spark optimization tips you always apply?
Beginner Level PySpark Interview Questions
Q1. What is Apache Spark and why was it developed?
Apache Spark is a fast, general-purpose data processing engine built to handle large-scale data workloads. It was created mainly to overcome Hadoop MapReduce’s disk-heavy, step-by-step processing. Spark keeps data in memory between multiple operations, so tasks like iterative analytics, ML, and interactive queries become much faster.
Q2. What is PySpark?
PySpark is the Python interface to Apache Spark. It lets you write Spark jobs using Python instead of Scala or Java. This is very useful when teams are already using Python for data analysis but need to scale to data that cannot fit on a single machine.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
df.show()
Q3. How is Spark different from Hadoop MapReduce?
MapReduce writes intermediate results to disk after every stage, which makes it reliable but slow. Spark executes most operations in memory and uses a DAG scheduler to optimize the entire flow. On top of that, Spark supports SQL, streaming, and ML in the same framework, whereas MapReduce is mainly batch-oriented.
Q4. What are the main advantages and limitations of using PySpark?
Advantages: Python-based, scalable, fast due to in-memory processing, rich libraries (SQL, MLlib, streaming), and fault-tolerant. Limitations: Some overhead due to Python-JVM communication, and a few advanced APIs arrive first in Scala.
Q5. Explain Spark’s architecture in simple terms.
A Spark application has a Driver (the main controller) and multiple Executors (workers). The driver converts your code into tasks, and the executors run those tasks on partitions of the data in parallel. A cluster manager (like YARN or Kubernetes) provides the resources.
Q6. What is SparkContext?
SparkContext is the entry point to the Spark core engine. It represents the connection to a Spark cluster and lets you create RDDs, broadcast variables, and accumulators. In modern code you usually access it via spark.sparkContext.
Q7. What is SparkSession and how is it different from SparkContext?
SparkSession is the unified entry point to Spark (SQL, DataFrames, reading files, UDFs). It internally creates and manages the SparkContext. So instead of using separate contexts (SQLContext, HiveContext), you just use spark.
Q8. What are the main Spark components/modules?
Spark Core, Spark SQL, Spark Streaming/Structured Streaming, MLlib, and GraphX. In PySpark projects, you’ll most often work with Core, SQL, and sometimes MLlib.
Q9. What is an RDD in Spark?
An RDD (Resilient Distributed Dataset) is an immutable, fault-tolerant, partitioned collection of elements that can be processed in parallel. “Resilient” means Spark can rebuild lost data from its lineage if a node fails.
Q10. How do you create an RDD in PySpark?
You can either parallelize a Python collection:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
Or read from external storage:
text_rdd = spark.sparkContext.textFile("logs/app.log")
Q11. What is the difference between RDD transformations and actions?
Transformations (like map, filter) build a new RDD from an existing one and are lazy. Actions (like collect, count) actually run the job and return a value or write data. Lazy evaluation lets Spark optimize the whole workflow before running it.
Q12. When should I prefer RDDs over DataFrames?
When the data is unstructured, or the logic is very custom and doesn’t fit DataFrame operations, or you need control over partition-level operations. For analytics and SQL-like work, DataFrames are preferred.
Q13. What is the difference between map(), flatMap(), and filter() in PySpark?
map() → 1 input → 1 output
flatMap() → 1 input → 0..n outputs (and flattens)
filter() → keeps only elements matching a condition
For example, splitting lines into words is best with flatMap.
Q14. What is the difference between repartition() and coalesce()?
repartition() can increase or decrease partitions and always does a shuffle.
coalesce() is used mainly to reduce partitions and avoids a full shuffle, so it’s cheaper.
Q15. What is the difference between cache(), persist(), and unpersist()?
cache() = store in memory (default).
persist() = store with a specific storage level (memory, disk, both).
unpersist() = remove it from cache/disk when you no longer need it, to free resources.
Intermediate Level PySpark Interview Questions
Q16. What is the difference between Broadcast variables and Accumulators?
Broadcast variables are read-only variables cached on each executor — useful for lookup tables. Accumulators are write-only from executors and read from the driver — useful for counters or sums across the cluster.
Q17. What is shuffling in Spark and why is it expensive?
Shuffling is the process of redistributing data across partitions, usually during joins or group operations. It’s expensive because it involves disk I/O and network transfer. That’s why we design transformations to minimize shuffles whenever possible.
Q18. groupByKey() vs reduceByKey() vs aggregateByKey() vs sortByKey()
groupByKey() – simple but can transfer a lot of data. reduceByKey() – combines values locally before shuffling → usually preferred. aggregateByKey() – allows custom aggregation logic. sortByKey() – sorts by key; triggers shuffle.
Q19. Why are DataFrames faster than RDDs?
DataFrames are optimized by the Catalyst Optimizer and executed using the Tungsten engine. Spark can reorder filters, push predicates down, and choose better joins automatically. RDDs do not get these optimizations.
Q20. Compare RDD, DataFrame, and Dataset.
RDD – low-level, more control, less optimization. DataFrame – high-level, schema-aware, best performance. Dataset – only in Scala/Java, combines type safety + Catalyst. In PySpark, DataFrame is the main API.
Q21. What is a Spark SQL DataFrame?
It’s a distributed table-like structure with named columns. You can query it using SQL as well as DataFrame methods, and Spark will optimize the plan for you.
Q22. What is a schema in Spark and why is it important?
A schema defines column names and data types. Providing a schema avoids Spark’s type inference on big files, which speeds up reading and leads to more reliable queries.
Q23. How does groupBy work in Spark DataFrames?
You group rows based on one or more columns and then apply aggregations (sum, count, avg, etc.).
df.groupBy("region").sum("revenue").show()
Q24. How do you pivot and unpivot a DataFrame in Spark?
Pivot turns row values into columns:
pivoted = df.groupBy("Product").pivot("Month").sum("Sales")
Unpivoting can be done using SQL’s stack() to turn columns back to rows.
Q25. What is a DAG and what is lineage in Spark?
A DAG is the logical execution plan Spark creates from your transformations. Lineage is the record of how an RDD/DataFrame was derived. Because Spark knows lineage, it can recompute lost data if a partition is lost.
Q26. What is data skew in Spark and how do you handle it?
Data skew happens when one key or partition has much more data than others, so one task runs very slowly. You can fix it using salting (add random prefixes to keys), increasing partitions, or broadcasting small tables to avoid skewed joins.
Q27. What are common optimization techniques in Spark?
Use DataFrames, filter early, broadcast small tables, avoid groupByKey, cache only when needed, tune spark.sql.shuffle.partitions, and repartition to balance workloads.
Q28. What is the Catalyst Optimizer?
Catalyst is Spark SQL’s query optimizer. It analyzes DataFrame/SQL queries, builds an optimized plan, and chooses the best physical operators (like broadcast hash join vs sort-merge join).
Q29. What is serialization/deserialization in Spark?
Serialization is turning objects into bytes so they can be sent over the network or stored. Spark is distributed, so efficient serialization reduces overhead. Deserialization is turning bytes back into objects on the other side.
Q30. What serializers does PySpark support?
On the Python side, PySpark uses Pickle by default. On the JVM side, Spark can use Java serializer or Kryo. Kryo is faster and more compact, so for large-scale jobs, teams often switch to Kryo on the JVM side.
Advanced Level PySpark Interview Questions
Q31. What is mapPartitions() and when would you use it?
mapPartitions() runs your function once per partition instead of once per row. This is useful when you need to init something expensive (like a DB connection or ML model) and reuse it for all rows in that partition.
Q32. What is a broadcast join in Spark?
It’s a join strategy where Spark sends the small table to all executors to avoid shuffling the big table. Ideal when one side is small (few MBs).
from pyspark.sql.functions import broadcast
joined = big_df.join(broadcast(small_df), "id")
Q33. What are Spark deployment modes (Client vs Cluster)?
In client mode, the driver runs where you submit the job (your machine). In cluster mode, the driver runs on one of the cluster nodes. Cluster mode is preferred for production since it doesn’t depend on your laptop being on.
Q34. What is the spark-submit command used for?
spark-submit is the CLI tool used to run a Spark job on a cluster. You can set master, deploy mode, resources, jars, and your PySpark script.
spark-submit --master yarn --deploy-mode cluster --name pyspark-job app.py
Q35. ORC vs Parquet vs CSV vs JSON – which one to choose?
CSV/JSON are human-readable but not optimized. Parquet and ORC are columnar, compressed, and support predicate pushdown, so they are best for analytics. In Lakehouse-style projects, Parquet is a very common default.
Q36. How do you read a CSV file with a custom delimiter in PySpark?
You can specify options while reading:
df = (spark.read
.option("header", True)
.option("delimiter", ";")
.csv("data/sales.csv"))
Q37. How do you deal with bad or corrupt data while reading files?
Spark supports read modes like PERMISSIVE (default), DROPMALFORMED (drop bad rows), and FAILFAST (stop immediately).
df = (spark.read
.option("mode", "DROPMALFORMED")
.json("logs.json"))
Q38. Why do out-of-memory (OOM) issues occur in Spark?
Mostly due to large shuffles, very big partitions, or caching too much data. You can prevent it by increasing partitions, using MEMORY_AND_DISK persistence, and avoiding unnecessary wide transformations.
Q39. How do you remove duplicate rows in a Spark DataFrame?
Use distinct() for full-row duplicates or dropDuplicates(["col1","col2"]) to de-duplicate based on specific columns.
Q40. Is PySpark faster than pandas?
On small data, pandas is often faster because it runs locally in memory. PySpark wins when the data is too big for one machine — it scales horizontally. So the answer depends on data size.
Q41. How do you create a Spark DataFrame from different file formats?
Spark has built-in readers:
df_csv = spark.read.csv("file.csv", header=True, inferSchema=True)
df_json = spark.read.json("file.json")
df_txt = spark.read.text("file.txt")
XML needs an extra package like spark-xml.
Q42. What is a paired RDD?
A paired RDD is an RDD of key-value pairs like ("a", 1). Many RDD operations such as joins and aggregations are built around this key-value structure.
Q43. What is the difference between Star Schema and Snowflake Schema?
Star Schema is denormalized (fewer joins, faster queries). Snowflake is normalized (more joins, less redundancy). In Spark-based reporting, star schemas are often preferred for performance.
Q44. How do you track failed jobs or stages in Spark?
Use the Spark Web UI (usually on port 4040) to see jobs, stages, tasks, shuffle read/write, and error logs. On clusters (YARN, Kubernetes), you can also inspect application logs from the respective resource manager.
Q45. What are some real-world Spark optimization tips you always apply?
Read with schema, push filters early, broadcast small dimension tables, avoid groupByKey, reduce shuffle partitions before writing, and don’t cache everything — only cache reused DataFrames. These small steps make Spark pipelines stable in production.