PySpark - Basics


10 Oct 2024  Amey Kolhe  2 mins read.


1. What is PySpark, how does it relate to Apache Spark?


PySpark is the Python API for Apache Spark, an open-source distributed computing system designed for large-scale data processing. PySpark allows Python developers to make use of the distributed computing capabilities of Spark to perform data analysis, machine learning, and more.


2. What are the advantages of using PySpark over other big data tools?


  • Python is very easy to learn and implement.
  • It provides simple and comprehensive API.
  • With Python, the readability of code, maintenance, and familiarity is far better.
  • It features various options for data visualization, which is difficult using Scala or Java.

Source - https://www.databricks.com/glossary/pyspark


3. What is RDD in PySpark, how does it differ from DataFrames and Dataset?


Feature RDDs DataFrames Datasets
Definition Low-level API for distributed data processing Higher-level abstraction with schema-based data representation Combines RDD type safety and DataFrame optimizations
Data Structure Distributed collection of objects Distributed collection of data with named columns Strongly-typed distributed collection of objects
Type Safety Yes No Yes
Schema Support No Yes Yes
Optimization No built-in query optimization Catalyst query optimizer Catalyst query optimizer + Tungsten in-memory computation
Ease of Use Requires more code; less intuitive High-level API; easier and more expressive for structured/semi-structured data High-level API with type safety and performance benefits
Performance Manual optimization needed Optimized using Catalyst Optimized using Catalyst and Tungsten
Use Cases - Unstructured data - Structured or semi-structured data - Type-safe structured or semi-structured data
  - Fine-grained control over transformations/actions - Simplified queries with SQL-like syntax - Scenarios needing both schema enforcement and optimized execution
Best For Advanced users needing low-level control Users wanting concise and schema-based operations Users requiring both the simplicity of DataFrames and the robustness of RDDs with type safety

Source - A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. The article discusses the key points about the differences and use cases of RDDs, DataFrames, and Datasets.


4. What are transformations & actions in PySpark?




  • Explain the concept of lazy evaluation in PySpark.
  • How does PySpark handle parallelism and partitioning?
  • What is the difference between map() & flatMap() transformations?
  • How do you create a DataFrame in PySpark? List different ways.
  • Explain the concept of SparkSession