PySpark – Dataframes
Table Of Contents:
- What Is PySpark DataFrames.
- Why Use DataFrames In PySpark?
- How To Create DataFrames In PySSark?
- Common DataFrame Operations.
- Lazy Evaluation.
- Under the Hood: Catalyst & Tungsten
(1) What Is PySpark DataFrames.
(2) Why Use DataFrames in PySpark?
(3) How to Create a DataFrame?
From A List:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
From A CSV File:
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show()
(4) Common DataFrame Operations
Filtering Rows:
df.filter(df.Age > 30).show()
Selecting Columns:
df.select("Name").show()
Group and Aggregate:
df.groupBy("Department").agg({"Salary": "avg"}).show()
Add a New Column:
df = df.withColumn("Bonus", df.Salary * 0.1)
Drop a Column:
df = df.drop("Bonus")
(5) Lazy Evaluation
(6) Catalyst & Tungsten

