Handling missing data is a crucial step in ensuring the quality and reliability of your data before any analysis or reporting. In PySpark, the dropna() method offers a flexible way to eliminate rows containing null, None, or NaN values from a DataFrame. In this post, we’ll explore how dropna() works, its parameters, and several practical use cases.
Syntax
fabricofdata_DF.dropna(how="any" or "all", thresh=int (optional), subset=["col1", "col2"] (optional))
Parameters
-
how: Determines the condition for dropping rows.
"any"
: Drops the row if any column has a null value."all"
: Drops the row only if all columns have null values.
- thresh: Sets the minimum number of non-null values a row must have to be retained.
- subset: A list of columns to consider when checking for null values.
Let’s understand how each parameter works using an example DataFrame.
dropna()
with default parameters:
When used without any arguments, dropna()
removes rows where any column contains a null value.
This is a simple and common approach to quickly clean your dataset.
dropna()
with how = "any"
| "all"
:
-
The
how
parameter defines whether to drop rows based on any or all null values.- Use the
how = "any"
if the requirement is to drop the entire row even if a single column has a null value. - Whereas use the
how = "all"
only to consider the row if at least one column has some non-null or NaN data.
- Use the
- Let's check this with an existing DataFrame.
dropna()
with how and subset:
To target specific columns while dropping nulls, use the subset parameter. This way, only the defined columns are checked for missing values.
dropna()
with thresh:
The thresh parameter ensures that only rows with less than the specified number of non-null values are dropped.
dropna()
with thresh and subset:
You can also combine thresh with subset to apply the threshold to specific columns only
dropna()
with how and thresh:
Although PySpark allows you to pass both how and thresh, in practice, the thresh parameter overrides how. Therefore, combining them might lead to unexpected results and is generally discouraged.
So it is clear with the output that though how="any" is used, the actual filtering will be done based on the thresh value
Conclusion
Effectively managing missing values is essential in real-world data processing.
PySpark’s dropna()
method is a powerful tool that offers multiple ways to remove incomplete rows —
whether by dropping rows with any nulls, applying a threshold of required non-null fields, or focusing on specific columns.
By understanding and using its parameters correctly, you can tailor your data-cleaning strategy to fit your project’s unique requirements.
Complete Code
import math
#Sample DataFrame
data = [
{"id": 1, "name": "Alice", "age": 25.0, "score": 85.0},
{"id": 2, "name": "Bob", "age": None, "score": None},
{"id": 3, "name": None, "age": 30.0, "score": math.nan},
{"id": 4, "name": "Diana", "age": 40.0, "score": 92.0},
{"id": None, "name": "Eve", "age": 22.0, "score": None},
{"id": 6, "name": "Frank", "age": None, "score": 75.0},
{"id": 7, "name": "Grace", "age": None, "score": None},
{"id": 8, "name": "Hank", "age": 28.0, "score": 88.0},
{"id": 9, "name": "Ivy", "age": math.nan, "score": 91.0},
{"id": 10, "name": None, "age": 35.0, "score": 78.0},
{"id": None, "name": None, "age": None, "score": None} # Fully null row
]
fabricofdata_DF = spark.createDataFrame(data)
display(fabricofdata_DF)
#Default Behaviour is how = 'any' thresh = None subset = None
default_DF= fabricofdata_DF.dropna()
display(default_DF)
#Dropna() with how = "any"|"all":
df2_any = fabricofdata_DF.dropna(how = 'any')
display(df2_any)
df3_all = fabricofdata_DF.dropna(how = 'all')
display(df3_all)
#dropna() with how and subset
df2_any = fabricofdata_DF.dropna(how = 'any',subset=["age","id"])
display(df2_any)
df3_all = fabricofdata_DF.dropna(how = 'all',subset=["age","id"])
display(df3_all)
#dropna() with Thresh
df_thresh = fabricofdata_DF.dropna(thresh=3)
display(df_thresh)
#dropna() with thresh and subset:
df_thers_sub = fabricofdata_DF.dropna(thresh= 2,subset=["age","name"])
display(df_thers_sub)
#dropna() with how and thresh:
df_h_t = fabricofdata_DF.dropna(how = "any",thresh=3)
display(df_h_t)