admin – Page 84

April 20, 2025

PySpark – PySpark MLlib

PySpark – PySpark MLLib Table Of Contents: What is PySpark MLlib? Two APIs in MLlib Why Use MLlib? Key Features Example ML Pipeline (End-to-End) Commonly Used Classes When to Use PySpark MLlib? (1) What Is PySpark MLLib? (2) Two APIs in MLlib (3) Why Use MLlib? (4) Key Features (5) What is a PySpark Pipeline? model.transform(data) model = pipeline.fit(data) from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import LogisticRegression # Step 1: Convert label to numeric indexer = StringIndexer(inputCol="purchased", outputCol="label") # Step 2: Assemble features assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features") # Step 3: Model lr = LogisticRegression(featuresCol="features",
Read More
April 20, 2025

PySpark – PySpark SQL

PySpark – PySpark SQL Table Of Contents: What is PySpark SQL? Why Use PySpark SQL? Setting It Up (Step-by-Step) SQL vs DataFrame APIs (Both Supported!) Advanced Features in PySpark SQL Input Data Formats Performance Optimizations Real-World Use Cases Summary (1) What is PySpark SQL? (2) Why Use PySpark SQL? (3) Setting It Up (Step-by-Step) Step 1: Create a SparkSession from pyspark.sql import SparkSession spark = SparkSession.builder .appName("PySparkSQLDemo") .getOrCreate() Step 2: Load Data into a DataFrame df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() Step 3: Register DataFrame as SQL Table (Temp View) df.createOrReplaceTempView("employees") Step 4: Run SQL Queries! result = spark.sql(""" SELECT
Read More
April 20, 2025

PySpark – DataFrames

PySpark – Dataframes Table Of Contents: What Is PySpark DataFrames. Why Use DataFrames In PySpark? How To Create DataFrames In PySSark? Common DataFrame Operations. Lazy Evaluation. Under the Hood: Catalyst & Tungsten (1) What Is PySpark DataFrames. (2) Why Use DataFrames in PySpark? (3) How to Create a DataFrame? From A List: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Example").getOrCreate() data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() From A CSV File: df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() (4) Common DataFrame Operations Filtering Rows: df.filter(df.Age > 30).show() Selecting Columns: df.select("Name").show() Group and
Read More
April 20, 2025

PySpark – Spark Application Lifecycle Overview

PySpark – Spark Application Lifecycle Overview Table Of Contents: Spark Application Starts Driver Program Is Launched Cluster Manager Allocates Resources Job is Created on Action DAG Scheduler Breaks Job into Stages Tasks are Sent to Executors Results Returned to Driver SparkContext Stops / Application Ends (1) Spark Application Start from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() We need to first initialize a Spark application to enable distributed data processing with Apache Spark. you are initializing a Spark application. This is the entry point for using Spark. (2) Driver Program Is Launched from pyspark.sql import SparkSession # This runs on the
Read More
April 20, 2025

PySpark – Apache PySpark Ecosystem Overview.

PySpark – Apache PySpark Ecosystem Overview. Table Of Contents: SparkContext RDD (Resilient Distributed Dataset) DataFrame Spark SQL SparkSession MLlib Spark Streaming / Structured Streaming GraphX / GraphFrames Data Sources & Integration Deployment & Cluster Management PySpark Libraries (1) Spark Context from pyspark import SparkContext sc = SparkContext("local", "MyApp") (2) RDD (Resilient Distributed Dataset) rdd = sc.parallelize([1, 2, 3, 4]) rdd2 = rdd.map(lambda x: x * 2) (3) DataFrame from pyspark.sql import SparkSession spark = SparkSession.builder.appName("App").getOrCreate() df = spark.read.csv("data.csv", header=True) (4) Spark SQL df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people WHERE age > 30").show() (4) SparkSession spark = SparkSession.builder.appName("App").getOrCreate() (5) MLlib from pyspark.ml.classification
Read More
April 20, 2025

PySpark – PySpark Vs Pandas Vs Dask .

PySpark – PySpark Vs Pandas Vs Dask Table Of Contents: PySpark Vs Pandas Vs Dask . Use Case-Based Comparison . Summary . (1) PySpark Vs Pandas Vs Dask (2) Use Case-Based Comparison . (3) Summary
Read More
April 20, 2025

PySpark – Why Use PySpark Over Python ?

PySpark – Why Use PySpark Over Python ? Table Of Contents: Why Use PySpark Over Python ? Distributed Computing. Big Data Support. Lazy Evaluation. In Built Fault Tolerance. Support For SQL, ML, Streaming and Graphs. Cluster Deployment. Optimized Engine. (1) Why Use PySpark Over Python ?
Read More
April 19, 2025

PySpark – What Is PySpark ?

PySpark – What Is Pyspark ? Table Of Contents: What Is PySpark ? What Is Distributed Computing ? What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? How Spark Works On Single Core Device ? (1) What Is PySpark ? (2) What Is Distributed Computing ? (3) What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? (4) How Spark Works On Single Core Device ?
Read More
April 19, 2025

PySpark – Syllabus

PySpark – Syllabus Table Of Contents:
Read More
April 18, 2025

NLP – BERT Architecture

NLP – BERT Architecture Table Of Contents: Introduction to BERT BERT Architecture Input Representation Pretraining Objectives Fine-Tuning BERT Variants of BERT BERT Evaluation and Benchmarks Advanced Concepts Implementation with Libraries Limitations and Challenges Applications of BERT (1) Introduction To BERT. (2) BERT – Questions What is BERT and the transformer, and why do I need to understand it? Models like BERT are already massively impacting academia and business, so we’ll outline some of the ways these models are used, and clarify some of the terminology around them. What did we do before these models? To understand these models, it’s important to look
Read More