PySpark – Dataframes

Table Of Contents:

  1. What Is PySpark DataFrames.
  2. Why Use DataFrames In PySpark?
  3. How To Create DataFrames In PySSark?
  4. Common DataFrame Operations.
  5. Lazy Evaluation.
  6. Under the Hood: Catalyst & Tungsten

(1) What Is PySpark DataFrames.

(2) Why Use DataFrames in PySpark?

(3) How to Create a DataFrame?

From A List:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()
From A CSV File:
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show()

(4) Common DataFrame Operations

Filtering Rows:
df.filter(df.Age > 30).show()
Selecting Columns:
df.select("Name").show()
Group and Aggregate:
df.groupBy("Department").agg({"Salary": "avg"}).show()
Add a New Column:
df = df.withColumn("Bonus", df.Salary * 0.1)
Drop a Column:
df = df.drop("Bonus")

(5) Lazy Evaluation

(6) Catalyst & Tungsten

Leave a Reply

Your email address will not be published. Required fields are marked *