Python Pyspark Puzzle 2

SAS

April 21, 2025

0

Using the provided customer orders data, write a PySpark query that calculates the total number of orders placed by each customer. Additionally, in the next column, list all the distinct item names ordered by that customer, separated by commas (e.g., "Laptop, Mobile, Pendrive").

Input Data:

Decorated Image

Tap For Closer Look

Expected OutPut:

Decorated Image

Tap For Closer Look

Input DataFrame Script:

from pyspark.sql.functions import col,count,max,round,filter
import random
from datetime import datetime, timedelta

# Generate the sample data
customers = [f"CUST{str(i).zfill(3)}" for i in range(1, 11)]
items = ["Laptop", "Smartphone", "Tablet", "Monitor", "Keyboard", "Mouse", "Printer", "Camera", "Speaker", "Router"]

data = []
order_id_counter = 1000
start_date = datetime(2025, 1, 1)

for _ in range(50):
    customer_id = random.choice(customers)
    item_name = random.choice(items)
    total_amount = random.randint(100, 5000)  # price range from 100 to 5000
    order_date = start_date + timedelta(days=random.randint(0, 365))
    order_id = f"ORD{order_id_counter}"
    order_id_counter += 1
    
    data.append([customer_id, order_id, order_date.strftime('%Y-%m-%d'), item_name, total_amount])

# Define schema for the DataFrame
columns = ["Customer_ID", "Order_ID", "Order_Date", "ItemName", "Total_Amount"]

# Create PySpark DataFrame
fabricofdata_df = spark.createDataFrame(data, columns)
display(fabricofdata_df)

Please try to answer the question and submit in comments section, if you still not able to get it click below show button to see the answer.

Tags:

Python-PySpark Puzzles

Newer
Older

Post a Comment (0)

Share to other apps

Copy Post Link