Handling Missing Data in PySpark Using fillna()

SAS
0
Handling Missing Data in PySpark Using fillna()

Dealing with missing data is a fundamental part of data preprocessing to ensure your analysis and reports are reliable. In PySpark, you can remove rows with missing values using the dropna() method, which removes any rows containing NULL, None, or NaN values.


However, dropping all missing data isn’t always ideal because you might lose valuable information from other columns in those rows. That's where PySpark's fillna() method comes in handy — it allows you to fill missing values with default values instead of removing entire rows.


In this post, we'll dive into how to use fillna() in PySpark effectively, including working with different data types and subsets of columns.


Quick Note:

Both DataFrame.na.fill() and DataFrame.fillna() do the same thing — they are interchangeable, so feel free to use whichever you prefer.


Syntax :

dataframe.fillna(value, subset=[*columns])


  • value: The default value to replace nulls with.
  • subset (optional): A list of columns to apply the fill operation on. If not provided, the fill will apply to all columns with null values.

Setting Up an Example DataFrame:

Let's start by creating a DataFrame that includes columns with various data types — integers, strings, booleans, and dates — some of which contain null values:


Using fillna() Without Specifying Columns:

If you don't specify the subset parameter, PySpark will replace null values for all columns where the data type of the column matches the data type of the value you provide

Example: Filling all nulls in integer-like columns with 0

This will replace all nulls in columns of integer, float, or double types with 0


Example: Filling all string nulls with a placeholder

Here, only columns with string data types will have their null values replaced with "missing"

Important: Handling Date Columns

fillna() does not support replacing null values in columns of date type directly. For example, trying to fill nulls in a date column like this

Workaround for Date Columns

A common approach is to convert the date column to a string, apply fillna(), and then convert it back to a date.

This way, the missing dates get replaced with a default date string, then properly converted back.

Using fillna() with a Subset of Columns:

If you only want to fill null values in specific columns, use the subset parameter.

Filling a Single Column

This replaces nulls in the Age column with 0 while leaving others unchanged.

Filling Multiple Columns with Different Values


Summary:

  • Use fillna() when you want to replace missing values without losing entire rows.
  • If you don’t specify columns via subset, PySpark applies the fill operation to all columns matching the value type.
  • For date columns, convert to string, fill missing values, then convert back to dates to avoid errors.

Handling missing data with care improves the robustness and accuracy of your downstream analysis and reporting. The fillna() method in PySpark is a versatile and efficient tool to achieve this.


Complete Code

from pyspark.sql.types  import StructType,StructField,IntegerType,StringType,DateType,FloatType,BooleanType
from pyspark.sql.functions import col

schema = StructType([
    StructField("ID",IntegerType(),True),
    StructField("Age",IntegerType(),True),
    StructField("Height",FloatType(),True),
    StructField("Name",StringType(),True),
    StructField("Is_Student",BooleanType(),True),
    StructField("Enrollment_Date",StringType(),True)
])

data = [
    (1,20,5.9,"Alice",True,"2025-01-01"),
    (2,22,6.1,"Bob",False,"2021-08-09"),
    (3,23,5.8,None,True,"2020-07,24"),
    (4,None,None,"David",None,None),
    (5,25,6.0,"Eve",False,"2025-06-19")
]

df=spark.createDataFrame(data,schema)
df=df.withColumn("Enrollment_Date",col("Enrollment_Date").cast(DateType()))
df.show()


df=df.fillna(0)
df.show()

df=df.fillna("*")
df.show()

df=df.withColumn("Enrollment_Date",col("Enrollment_Date").cast(StringType()))
df = df.fillna("1900-01-01", subset=["Enrollment_Date"])
df=df.withColumn("Enrollment_Date",col("Enrollment_Date").cast(DateType()))
df.show()

df=df.fillna(0,subset=["Age"])
df.show()

df=df.fillna({"Age":0,"Height":0.1,"Name":"*"})
df.show()
Tags:

Post a Comment

0Comments

Post a Comment (0)