Given an employee salary dataset, write a query to perform the following operations:
- Rank employees within each department based on their salary.
- Calculate the average salary for each department.
- Extract the top two highest-paid employees from each department.
- Ensure that ties in salary rankings are handled appropriately.
Input Data:
Expected Output:
Input DataFrame Script
employee_data = [
(1001, 2024, 2, 101, 50000),
(1002, 2024, 4, 101, 60000),
(1003, 2024, 1, 101, 45000),
(1004, 2024, 5, 101, 65000),
(1005, 2024, 3, 101, 55000),
(2001, 2024, 5, 102, 90000),
(2002, 2024, 2, 102, 75000),
(2003, 2024, 3, 102, 80000),
(2004, 2024, 4, 102, 85000),
(2005, 2024, 1, 102, 70000),
(3001, 2024, 2, 103, 55000),
(3002, 2024, 4, 103, 65000),
(3003, 2024, 5, 103, 70000),
(3004, 2024, 3, 103, 60000),
(3005, 2024, 1, 103, 50000),
(4001, 2024, 4, 104, 80000),
(4002, 2024, 5, 104, 85000),
(4003, 2024, 1, 104, 60000),
(4004, 2024, 3, 104, 75000),
(4005, 2024, 2, 104, 70000),
(5001, 2024, 5, 105, 65000),
(5002, 2024, 2, 105, 50000),
(5003, 2024, 3, 105, 55000),
(5004, 2024, 4, 105, 60000),
(5005, 2024, 1, 105, 45000)
]
employee_DF = spark.createDataFrame(employee_data,"EMPID int,FiscalYear int,Rating int,DepartmentID int, Salary int")
display(employee_DF)
department_data = [ ("HR", 101),
("Engineering", 102),
("Sales", 103),
("Finance", 104),
("Marketing", 105)
]
department_DF = spark.createDataFrame(department_data,"Department_Name string,DepartmentID int")
display(department_DF)
Try solving the question yourself! If you need help, click below to reveal the solution.