Python Pyspark Puzzle 2

SAS
0

Using the provided customer orders data, write a PySpark query that calculates the total number of orders placed by each customer. Additionally, in the next column, list all the distinct item names ordered by that customer, separated by commas (e.g., "Laptop, Mobile, Pendrive").


Input Data:


Expected OutPut:


Input DataFrame Script:
from pyspark.sql.functions import col,count,max,round,filter
import random
from datetime import datetime, timedelta

# Generate the sample data
customers = [f"CUST{str(i).zfill(3)}" for i in range(1, 11)]
items = ["Laptop", "Smartphone", "Tablet", "Monitor", "Keyboard", "Mouse", "Printer", "Camera", "Speaker", "Router"]

data = []
order_id_counter = 1000
start_date = datetime(2025, 1, 1)

for _ in range(50):
    customer_id = random.choice(customers)
    item_name = random.choice(items)
    total_amount = random.randint(100, 5000)  # price range from 100 to 5000
    order_date = start_date + timedelta(days=random.randint(0, 365))
    order_id = f"ORD{order_id_counter}"
    order_id_counter += 1
    
    data.append([customer_id, order_id, order_date.strftime('%Y-%m-%d'), item_name, total_amount])

# Define schema for the DataFrame
columns = ["Customer_ID", "Order_ID", "Order_Date", "ItemName", "Total_Amount"]

# Create PySpark DataFrame
fabricofdata_df = spark.createDataFrame(data, columns)
display(fabricofdata_df)

Please try to answer the question and submit in comments section, if you still not able to get it click below show button to see the answer.



from pyspark.sql.functions import collect_set,concat_ws,count

fabricofdata_set = fabricofdata_df.groupby("Customer_ID").agg(
    count("Order_ID").alias("TotalOrders"),
    collect_set("ItemName").alias("ItemsList")
)
display(fabricofdata_set.withColumn("ItemsList",concat_ws(",",col("ItemsList"))))






Post a Comment

0Comments

Post a Comment (0)

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Check Now
Ok, Go it!