- 1. What is PySpark, how does it relate to Apache Spark?
- 2. What are the advantages of using PySpark over other big data tools?
- 3. What is RDD in PySpark, how does it differ from DataFrames and Dataset?
- 4. What are transformations & actions in PySpark?
- What is a Spark DAG, and how is it related to PySpark job execution?
1. What is PySpark, how does it relate to Apache Spark?
PySpark is the Python API for Apache Spark, an open-source distributed computing system designed for large-scale data processing. PySpark allows Python developers to make use of the distributed computing capabilities of Spark to perform data analysis, machine learning, and more.
2. What are the advantages of using PySpark over other big data tools?
- Python is very easy to learn and implement.
- It provides simple and comprehensive API.
- With Python, the readability of code, maintenance, and familiarity is far better.
- It features various options for data visualization, which is difficult using Scala or Java.
Source - https://www.databricks.com/glossary/pyspark
3. What is RDD in PySpark, how does it differ from DataFrames and Dataset?
Feature | RDDs | DataFrames | Datasets |
---|---|---|---|
Definition | Low-level API for distributed data processing | Higher-level abstraction with schema-based data representation | Combines RDD type safety and DataFrame optimizations |
Data Structure | Distributed collection of objects | Distributed collection of data with named columns | Strongly-typed distributed collection of objects |
Type Safety | Yes | No | Yes |
Schema Support | No | Yes | Yes |
Optimization | No built-in query optimization | Catalyst query optimizer | Catalyst query optimizer + Tungsten in-memory computation |
Ease of Use | Requires more code; less intuitive | High-level API; easier and more expressive for structured/semi-structured data | High-level API with type safety and performance benefits |
Performance | Manual optimization needed | Optimized using Catalyst | Optimized using Catalyst and Tungsten |
Use Cases | - Unstructured data | - Structured or semi-structured data | - Type-safe structured or semi-structured data |
- Fine-grained control over transformations/actions | - Simplified queries with SQL-like syntax | - Scenarios needing both schema enforcement and optimized execution | |
Best For | Advanced users needing low-level control | Users wanting concise and schema-based operations | Users requiring both the simplicity of DataFrames and the robustness of RDDs with type safety |
Source - A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. The article discusses the key points about the differences and use cases of RDDs, DataFrames, and Datasets.
4. What are transformations & actions in PySpark?
What is a Spark DAG, and how is it related to PySpark job execution?
- Explain the concept of lazy evaluation in PySpark.
- How does PySpark handle parallelism and partitioning?
- What is the difference between map() & flatMap() transformations?
- How do you create a DataFrame in PySpark? List different ways.
- Explain the concept of SparkSession