PySpark – Apache PySpark Ecosystem Overview.
Table Of Contents:
- SparkContext
- RDD (Resilient Distributed Dataset)
- DataFrame
- Spark SQL
- SparkSession
- MLlib
- Spark Streaming / Structured Streaming
- GraphX / GraphFrames
- Data Sources & Integration
- Deployment & Cluster Management
- PySpark Libraries
(1) Spark Context
from pyspark import SparkContext
sc = SparkContext("local", "MyApp")
(2) RDD (Resilient Distributed Dataset)
rdd = sc.parallelize([1, 2, 3, 4])
rdd2 = rdd.map(lambda x: x * 2)
(3) DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("App").getOrCreate()
df = spark.read.csv("data.csv", header=True)
(4) Spark SQL
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 30").show()
(4) SparkSession
spark = SparkSession.builder.appName("App").getOrCreate()
(5) MLlib
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(trainingData)
(6) Spark Streaming / Structured Streaming
(7) GraphX / GraphFrames
(8) Summary Table

