FabricOfData

Showing posts from April, 2025

Python Pyspark Puzzle 5

April 27, 2025 0

Output the employee details whose salary is second highest in their department. Note, Include all employees tied for second place Input Data Tap For Closer Look …

Python-PySpark Puzzles

SAS

Track customers who made a transaction greater than 50000 exactly on the 10th day after account creation. Use PySpark filtering, date difference functions, and DataFrame operations to solve this real-world query puzzl…

Python-PySpark Puzzles

SAS

Python Pyspark Puzzle 3

April 27, 2025 0

Using the provided population dataset, write a PySpark query to find and list all the countries where the female population is greater than the male population. …

Python-PySpark Puzzles

SAS

Python Pyspark Puzzle 2

April 21, 2025 0

Using the provided customer orders data, write a PySpark query that calculates the total number of orders placed by each customer. Additionally, in the next column, list all the distinct item names ordered by that customer, separated by…

Python-PySpark Puzzles

SAS

Python PySprk Puzzle 1

April 21, 2025 0

PySpark Puzzle: Retrieve Orders Within 50% of Highest Order Amount Retrieve the order details of each customer where the order amount of each order is within 50% of the highest ordered amount for that customer. …

Python-PySpark Puzzles

SAS

What is the difference between cache and persist

April 20, 2025 0

Apache Spark: Difference Between cache() and persist() Explained Difference Between cache() and persist() in Apache Spark Feature cache() persist() …

PySpark-Q&A

SAS

How can you uncache data in Spark

April 20, 2025 0

How to Unpersist DataFrames and RDDs in Apache Spark You can uncache a DataFrame or RDD by using the .unpersist() method. This frees up memory or other storage resources. Copy fabricofdata_d…

PySpark-Q&A

SAS

When should we avoid using caching

April 20, 2025 0

When to Avoid Caching in Apache Spark Avoid caching when: The dataset is used only once. Memory constraints exist, and caching large datasets may lead to disk spills or OutOfMemory errors. The cost of…

PySpark-Q&A

SAS

Why do we use caching and persisting in Spark

April 20, 2025 0

We use caching and persisting to improve performance and execution time. When a dataset is used multiple times, these methods help Spark avoid recomputing the same data repeatedly. This is especially useful in complex transformation pipelines. Whil…

PySpark-Q&A

SAS

What is caching in Spark

April 20, 2025 0

Caching in Spark is a method used to store intermediate results of a DataFrame or RDD in memory, which helps avoid recomputation during future transformations. In PySpark, calling .cache() stores the data u…

PySpark-Q&A

SAS

Optimize Spark with Cache & Persist

April 20, 2025 0

In PySpark, cache() and persist() are powerful optimization techniques used with DataFrames, Datasets, and RDDs. These methods allow you to store intermediate results in memory or other storage levels, reducing…

pySpark

SAS

PySpark: sort() vs orderBy()

April 17, 2025 0

Just like the SQL Server 'ORDER BY' clause, PySpark provides the orderBy() and sort() functions to sort data within RDDs and DataFrames. Since PySpark provides two functions for the same functionality, t…

pySpark

SAS

When would you choose MEMORY_AND_DISK over MEMORY_ONLY?

April 16, 2025 0

When the dataset is too large to fit entirely in memory. MEMORY_AND_DISK prevents re-computation by saving the overflow to disk.

PySpark-Q&A

SAS

What are the different storage levels supported in Spark? Briefly describe each one

April 16, 2025 0

Spark provides several storage levels to help control how data is cached across memory and disk. Each level offers a different trade-off between performance, memory usage, and fault tolerance. Below are the m…

PySpark-Q&A

SAS

Spark Storage Levels Explained: MEMORY vs DISK with Real Use Cases

April 16, 2025 0

When working with PySpark, writing code might come easy—but understanding how Spark handles data behind the scenes is just as important. Specifically, knowing how Spark stores data internally can help you wri…

pySpark

SAS