Given a dataset of customer transactions or movements over time, how can you determine the origin (first location) and destination (last location) for each customer using PySpark?
Input Data:
Expected Output:
Input DataFrame Script
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
schema = StructType([
StructField("Customer_ID", IntegerType()),
StructField("TicketNumber" ,StringType()),
StructField("Origin" ,StringType()),
StructField("Destination", StringType())
])
data = [(1,"T-12345","Hyderabad","Kolkatta"),
(1,"T-12345","Kolkatta","Patna"),
(1,"T-12345","Patna","Delhi"),
(2,"T-56789","Chennai","NCR"),
(2,"T-56789","NCR","Agra")
]
fabricofdata_DF = spark.createDataFrame(data,schema)
fabricofdata_DF.show()
Try solving the question yourself! If you need help, click below to reveal the solution.