PySpark – Praudyog

April 20, 2025

PySpark – PySpark Graphs

PySpark – PySpark GraphX / GraphFrames Table Of Contents: What is a Graph in PySpark? Example Of PySpark Graph. Why Use Graphs in PySpark? Where Does The Pyspark Graph Is Used In Real Life? (1) What is a Graph in PySpark? (2) Example Of PySpark Graph. from graphframes import GraphFrame g = GraphFrame(vertices, edges) (3) Why Use Graphs in PySpark? (4) PySpark Real Life Examples
Read More
April 20, 2025

PySpark – PySpark Streaming

PySpark – PySpark Streaming Table Of Contents: What is Spark Streaming? What is Structured Streaming? Key Concepts Example Code Spark Streaming vs. Structured Streaming Use Cases (1) What Is Spark Streaming ? (2) What Is Structured Streaming ? (3) Key Concepts: (4) Example Code from pyspark.sql import SparkSession spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate() # Read stream from a socket source df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() # Word count logic words = df.selectExpr("explode(split(value, ' ')) as word") word_counts = words.groupBy("word").count() # Write the results to the console query = word_counts.writeStream.outputMode("complete").format("console").start() query.awaitTermination() (5) Spark Streaming vs. Structured Streaming (6) Use Cases
Read More
April 20, 2025

PySpark – PySpark MLlib

PySpark – PySpark MLLib Table Of Contents: What is PySpark MLlib? Two APIs in MLlib Why Use MLlib? Key Features Example ML Pipeline (End-to-End) Commonly Used Classes When to Use PySpark MLlib? (1) What Is PySpark MLLib? (2) Two APIs in MLlib (3) Why Use MLlib? (4) Key Features (5) What is a PySpark Pipeline? model.transform(data) model = pipeline.fit(data) from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import LogisticRegression # Step 1: Convert label to numeric indexer = StringIndexer(inputCol="purchased", outputCol="label") # Step 2: Assemble features assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features") # Step 3: Model lr = LogisticRegression(featuresCol="features",
Read More
April 20, 2025

PySpark – PySpark SQL

PySpark – PySpark SQL Table Of Contents: What is PySpark SQL? Why Use PySpark SQL? Setting It Up (Step-by-Step) SQL vs DataFrame APIs (Both Supported!) Advanced Features in PySpark SQL Input Data Formats Performance Optimizations Real-World Use Cases Summary (1) What is PySpark SQL? (2) Why Use PySpark SQL? (3) Setting It Up (Step-by-Step) Step 1: Create a SparkSession from pyspark.sql import SparkSession spark = SparkSession.builder .appName("PySparkSQLDemo") .getOrCreate() Step 2: Load Data into a DataFrame df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() Step 3: Register DataFrame as SQL Table (Temp View) df.createOrReplaceTempView("employees") Step 4: Run SQL Queries! result = spark.sql(""" SELECT
Read More
April 20, 2025

PySpark – DataFrames

PySpark – Dataframes Table Of Contents: What Is PySpark DataFrames. Why Use DataFrames In PySpark? How To Create DataFrames In PySSark? Common DataFrame Operations. Lazy Evaluation. Under the Hood: Catalyst & Tungsten (1) What Is PySpark DataFrames. (2) Why Use DataFrames in PySpark? (3) How to Create a DataFrame? From A List: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Example").getOrCreate() data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() From A CSV File: df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() (4) Common DataFrame Operations Filtering Rows: df.filter(df.Age > 30).show() Selecting Columns: df.select("Name").show() Group and
Read More
April 20, 2025

PySpark – Spark Application Lifecycle Overview

PySpark – Spark Application Lifecycle Overview Table Of Contents: Spark Application Starts Driver Program Is Launched Cluster Manager Allocates Resources Job is Created on Action DAG Scheduler Breaks Job into Stages Tasks are Sent to Executors Results Returned to Driver SparkContext Stops / Application Ends (1) Spark Application Start from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() We need to first initialize a Spark application to enable distributed data processing with Apache Spark. you are initializing a Spark application. This is the entry point for using Spark. (2) Driver Program Is Launched from pyspark.sql import SparkSession # This runs on the
Read More
April 20, 2025

PySpark – Apache PySpark Ecosystem Overview.

PySpark – Apache PySpark Ecosystem Overview. Table Of Contents: SparkContext RDD (Resilient Distributed Dataset) DataFrame Spark SQL SparkSession MLlib Spark Streaming / Structured Streaming GraphX / GraphFrames Data Sources & Integration Deployment & Cluster Management PySpark Libraries (1) Spark Context from pyspark import SparkContext sc = SparkContext("local", "MyApp") (2) RDD (Resilient Distributed Dataset) rdd = sc.parallelize([1, 2, 3, 4]) rdd2 = rdd.map(lambda x: x * 2) (3) DataFrame from pyspark.sql import SparkSession spark = SparkSession.builder.appName("App").getOrCreate() df = spark.read.csv("data.csv", header=True) (4) Spark SQL df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people WHERE age > 30").show() (4) SparkSession spark = SparkSession.builder.appName("App").getOrCreate() (5) MLlib from pyspark.ml.classification
Read More
April 20, 2025

PySpark – PySpark Vs Pandas Vs Dask .

PySpark – PySpark Vs Pandas Vs Dask Table Of Contents: PySpark Vs Pandas Vs Dask . Use Case-Based Comparison . Summary . (1) PySpark Vs Pandas Vs Dask (2) Use Case-Based Comparison . (3) Summary
Read More
April 20, 2025

PySpark – Why Use PySpark Over Python ?

PySpark – Why Use PySpark Over Python ? Table Of Contents: Why Use PySpark Over Python ? Distributed Computing. Big Data Support. Lazy Evaluation. In Built Fault Tolerance. Support For SQL, ML, Streaming and Graphs. Cluster Deployment. Optimized Engine. (1) Why Use PySpark Over Python ?
Read More
April 19, 2025

PySpark – What Is PySpark ?

PySpark – What Is Pyspark ? Table Of Contents: What Is PySpark ? What Is Distributed Computing ? What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? How Spark Works On Single Core Device ? (1) What Is PySpark ? (2) What Is Distributed Computing ? (3) What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? (4) How Spark Works On Single Core Device ?
Read More

Category: PySpark