Track customers who made a transaction greater than 50000 exactly on the 10th day after account creation. Use PySpark filtering, date difference functions, and DataFrame operations to solve this real-world query puzzle.
Input Data
Expected Output
Input DataFrame Script
from pyspark.sql.types import StructType,StructField,StringType,DateType,IntegerType
from pyspark.sql.functions import col,cast
data = [
("C001", "2024-01-01", "2024-01-11", 60000),
("C001", "2024-01-01", "2024-01-02", 45000),
("C002", "2024-02-01", "2024-02-01", 52000),
("C003", "2024-03-05", "2024-03-15", 51000),
("C004", "2024-01-10", "2024-09-18", 48000)
]
schema = StructType([
StructField("CustomerID", StringType(), True),
StructField("AccountCreationDate", StringType(), True),
StructField("TransactionDate", StringType(), True),
StructField("TransactionAmount", IntegerType(), True)
])
fabricofdata_DF = spark.createDataFrame(data, schema)
fabricofdata_DF = fabricofdata_DF.withColumn("AccountCreationDate", col("AccountCreationDate").cast(DateType())) \
.withColumn("TransactionDate", col("TransactionDate").cast(DateType()))
display(fabricofdata_DF)
Try solving the question yourself! If you need help, click below to reveal the solution.