• PySpark – PySpark Graphs

    PySpark – PySpark Graphs

    PySpark – PySpark GraphX / GraphFrames Table Of Contents: What is a Graph in PySpark? Example Of PySpark Graph. Why Use Graphs in PySpark? Where Does The Pyspark Graph Is Used In Real Life? (1) What is a Graph in PySpark? (2) Example Of PySpark Graph. from graphframes import GraphFrame g = GraphFrame(vertices, edges) (3) Why Use Graphs in PySpark? (4) PySpark Real Life Examples

    Read More

  • PySpark – PySpark Streaming

    PySpark – PySpark Streaming

    PySpark – PySpark Streaming Table Of Contents: What is Spark Streaming? What is Structured Streaming? Key Concepts Example Code Spark Streaming vs. Structured Streaming Use Cases (1) What Is Spark Streaming ? (2) What Is Structured Streaming ? (3) Key Concepts: (4) Example Code from pyspark.sql import SparkSession spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate() # Read stream from a socket source df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() # Word count logic words = df.selectExpr("explode(split(value, ' ')) as word") word_counts = words.groupBy("word").count() # Write the results to the console query = word_counts.writeStream.outputMode("complete").format("console").start() query.awaitTermination() (5) Spark Streaming vs. Structured Streaming (6) Use Cases

    Read More

  • PySpark – PySpark MLlib

    PySpark – PySpark MLlib

    PySpark – PySpark MLLib Table Of Contents: What is PySpark MLlib? Two APIs in MLlib Why Use MLlib? Key Features Example ML Pipeline (End-to-End) Commonly Used Classes When to Use PySpark MLlib? (1) What Is PySpark MLLib? (2) Two APIs in MLlib (3) Why Use MLlib? (4) Key Features (5) What is a PySpark Pipeline? model.transform(data) model = pipeline.fit(data) from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import LogisticRegression # Step 1: Convert label to numeric indexer = StringIndexer(inputCol="purchased", outputCol="label") # Step 2: Assemble features assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features") # Step 3: Model lr = LogisticRegression(featuresCol="features",

    Read More

  • PySpark – PySpark SQL

    PySpark – PySpark SQL

    PySpark – PySpark SQL Table Of Contents: What is PySpark SQL? Why Use PySpark SQL? Setting It Up (Step-by-Step) SQL vs DataFrame APIs (Both Supported!) Advanced Features in PySpark SQL Input Data Formats Performance Optimizations Real-World Use Cases Summary (1) What is PySpark SQL? (2) Why Use PySpark SQL? (3) Setting It Up (Step-by-Step) Step 1: Create a SparkSession from pyspark.sql import SparkSession spark = SparkSession.builder .appName("PySparkSQLDemo") .getOrCreate() Step 2: Load Data into a DataFrame df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() Step 3: Register DataFrame as SQL Table (Temp View) df.createOrReplaceTempView("employees") Step 4: Run SQL Queries! result = spark.sql(""" SELECT

    Read More

  • PySpark – DataFrames

    PySpark – DataFrames

    PySpark – Dataframes Table Of Contents: What Is PySpark DataFrames. Why Use DataFrames In PySpark? How To Create DataFrames In PySSark? Common DataFrame Operations. Lazy Evaluation. Under the Hood: Catalyst & Tungsten (1) What Is PySpark DataFrames. (2) Why Use DataFrames in PySpark? (3) How to Create a DataFrame? From A List: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Example").getOrCreate() data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() From A CSV File: df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() (4) Common DataFrame Operations Filtering Rows: df.filter(df.Age > 30).show() Selecting Columns: df.select("Name").show() Group and

    Read More

  • PySpark – Spark Application Lifecycle Overview

    PySpark – Spark Application Lifecycle Overview

    PySpark – Spark Application Lifecycle Overview Table Of Contents: Spark Application Starts Driver Program Is Launched Cluster Manager Allocates Resources Job is Created on Action DAG Scheduler Breaks Job into Stages Tasks are Sent to Executors Results Returned to Driver SparkContext Stops / Application Ends (1) Spark Application Start from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() We need to first initialize a Spark application to enable distributed data processing with Apache Spark. you are initializing a Spark application. This is the entry point for using Spark. (2) Driver Program Is Launched from pyspark.sql import SparkSession # This runs on the

    Read More

  • PySpark – Apache PySpark Ecosystem Overview.

    PySpark – Apache PySpark Ecosystem Overview.

    PySpark – Apache PySpark Ecosystem Overview. Table Of Contents: SparkContext RDD (Resilient Distributed Dataset) DataFrame Spark SQL SparkSession MLlib Spark Streaming / Structured Streaming GraphX / GraphFrames Data Sources & Integration Deployment & Cluster Management PySpark Libraries (1) Spark Context from pyspark import SparkContext sc = SparkContext("local", "MyApp") (2) RDD (Resilient Distributed Dataset) rdd = sc.parallelize([1, 2, 3, 4]) rdd2 = rdd.map(lambda x: x * 2) (3) DataFrame from pyspark.sql import SparkSession spark = SparkSession.builder.appName("App").getOrCreate() df = spark.read.csv("data.csv", header=True) (4) Spark SQL df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people WHERE age > 30").show() (4) SparkSession spark = SparkSession.builder.appName("App").getOrCreate() (5) MLlib from pyspark.ml.classification

    Read More

  • PySpark – PySpark Vs Pandas Vs Dask .

    PySpark – PySpark Vs Pandas Vs Dask .

    PySpark – PySpark Vs Pandas Vs Dask Table Of Contents: PySpark Vs Pandas Vs Dask . Use Case-Based Comparison . Summary . (1) PySpark Vs Pandas Vs Dask (2) Use Case-Based Comparison . (3) Summary

    Read More

  • PySpark – Why Use PySpark Over Python ?

    PySpark – Why Use PySpark Over Python ?

    PySpark – Why Use PySpark Over Python ? Table Of Contents: Why Use PySpark Over Python ? Distributed Computing. Big Data Support. Lazy Evaluation. In Built Fault Tolerance. Support For SQL, ML, Streaming and Graphs. Cluster Deployment. Optimized Engine. (1) Why Use PySpark Over Python ?

    Read More

  • PySpark – What Is PySpark ?

    PySpark – What Is PySpark ?

    PySpark – What Is Pyspark ? Table Of Contents: What Is PySpark ? What Is Distributed Computing ? What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? How Spark Works On Single Core Device ? (1) What Is PySpark ? (2) What Is Distributed Computing ? (3) What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? (4) How Spark Works On Single Core Device ?

    Read More