Python PySpark Puzzle 9 - Identify Frequent Customers

SAS
0

Identify frequent customers who had ≥5 service appointments in specific service types (‘Engine Repair’, ‘Transmission Check’). Calculate their total service cost and average days between appointments. Exclude null amounts, handle duplicate appointments, and ensure valid service dates. Return customer_id, total_service_cost, avg_days_between, ordered by total_service_cost in descending order.


Input Data:

Expected Output:

Input DataFrame Script

import random
from datetime import datetime, timedelta

service_types = ["Engine Repair", "Transmission Check", "Brake Inspection", "Oil Change", "Tire Rotation"]
customer_ids = range(1001, 1051)  # 50 customers
appointment_ids = range(1, 501)  # 500 appointments

# Generate random service appointment data
service_appointments_data = [
    (appointment_id,
     random.choice(customer_ids),
     (datetime(2025, 1, 1) + timedelta(days=random.randint(0, 90))).strftime("%Y-%m-%d"),
     random.choice(service_types))
    for appointment_id in appointment_ids
]

service_appointments_columns = ["appointment_id", "customer_id", "service_date", "service_type"]
service_appointments_df = spark.createDataFrame(service_appointments_data, service_appointments_columns)
service_appointments_df.show()


# Generate random billing data corresponding to appointments
billing_ids = range(501, 1001)  # 500 billing records

billing_data = [
    (bill_id,
     appointment_id,
     (datetime(2024, 1, 2) + timedelta(days=random.randint(0, 90))).strftime("%Y-%m-%d"),
     random.randint(5000, 20000))  # Random billing amount between 5000-20000
    for bill_id, appointment_id in zip(billing_ids, appointment_ids)
]

billing_columns = ["bill_id", "appointment_id", "bill_date", "bill_amount"]
billing_df = spark.createDataFrame(billing_data, billing_columns)
billing_df.show()

Try solving the question yourself! If you need help, click below to reveal the solution.





Post a Comment

0Comments

Post a Comment (0)