In this post, we’ll cover everything you need to know about four important PySpark functions: explode(), explode_outer(), posexplode(),
and posexplode_outer()
. These functions help you convert array or map columns into multiple rows, which is essential when working with nested data.
Why do we need these functions?
All four functions share the same core purpose: they take each element inside an array or map column and create a new row for it in the dataframe. However, they differ in how they handle empty arrays and what exactly they return.
Let’s explore each function step by step
explode():
- What it does:
explode() takes an array or map column and generates a new row for each non-empty element in it - What counts as non-empty?
The array or map should have at least one element, which can be a valid value, None (null), or an empty string (""). But if the array is completely empty (no elements at all), explode() will skip that row and not include it in the output. - Lets see it with below example
- Explanation
When you apply explode() on the array column, it creates separate rows for each element, including None and empty strings inside the arrays. But the row with "fabricofData" is left out because its array is empty.
explode_outer():
- How is it different?
explode_outer() works almost the same as explode(), but it does not drop rows with empty arrays. Instead, it includes those rows in the output, showing null where there are no elements. - When to use:?
Use explode_outer() if you want to keep rows that have empty or null arrays, instead of losing them. - Lets see it with below example
posexplode():
- What’s new here?
posexplode() also expands arrays like explode(), but it returns two pieces of information for each element:- The position (index) of the element in the array
- The element value itself
- How To use?
Because it returns two columns — position and value — you cannot use posexplode() inside withColumn(), which expects a single column as output. Instead, use it inside the select() function. - Important note:
Like explode(), posexplode() skips rows where the array is completely empty - Lets see it with below example
posexplode_outer():
- What does it do?
This function combines the features of posexplode() and explode_outer() — it returns both the element’s position and value, and it includes rows with empty or null arrays. - When to use:?
Use this when you need the position of each element and want to keep rows even if their arrays are empty. - Lets see it with below example
Summary:
Function | Keeps Empty Arrays? | Returns Position? | How to Use |
---|---|---|---|
explode() |
No | No | Use in withColumn() or select() to flatten arrays |
explode_outer() |
Yes | No | Keeps rows with empty arrays |
posexplode() |
No | Yes | Use in select() to get element and its position |
posexplode_outer() |
Yes | Yes | Use in select() and keep rows with empty arrays |
These functions make working with nested arrays and maps easier by flattening them into rows. Choose the one that fits your needs based on whether you want to include empty arrays and whether you need the element positions
Complete Code
from pyspark.sql.functions import explode,explode_outer,posexplode,posexplode_outer
data = [
("James",["pyspark","Java","Scala",None,""]),
("Michael",[None]),
("Robert",[""]),
("FabricOfData",[])
]
schema = ["Name","Languages"]
df=spark.createDataFrame(data,schema)
#Explode
df_Explode = df.withColumn("Exploded_Language",explode("Languages"))
#Explode_Outer
df_Explode = df.withColumn("Exploded_Language",explode_outer("Languages"))
#posexplode
df_Explode = df.select("*",posexplode("Languages").alias("position","Value"))
#posexplode_outer
df_Explode = df.select("*",posexplode_outer("Languages").alias("position","Value"))
#showing the output
df_Explode.show(truncate=False)