PySpark – Apache PySpark Ecosystem Overview.


PySpark – Apache PySpark Ecosystem Overview.

Table Of Contents:

  1. SparkContext
  2. RDD (Resilient Distributed Dataset)
  3. DataFrame
  4. Spark SQL
  5. SparkSession
  6. MLlib
  7. Spark Streaming / Structured Streaming
  8. GraphX / GraphFrames
  9. Data Sources & Integration
  10. Deployment & Cluster Management
  11. PySpark Libraries

(1) Spark Context

from pyspark import SparkContext
sc = SparkContext("local", "MyApp")

(2) RDD (Resilient Distributed Dataset)

rdd = sc.parallelize([1, 2, 3, 4])
rdd2 = rdd.map(lambda x: x * 2)

(3) DataFrame

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("App").getOrCreate()
df = spark.read.csv("data.csv", header=True)

(4) Spark SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 30").show()

(4) SparkSession

spark = SparkSession.builder.appName("App").getOrCreate()

(5) MLlib

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(trainingData)

(6) Spark Streaming / Structured Streaming

(7) GraphX / GraphFrames

(8) Summary Table

Leave a Reply

Your email address will not be published. Required fields are marked *