-

PySpark – PySpark MLlib
PySpark – PySpark MLLib Table Of Contents: What is PySpark MLlib? Two APIs in MLlib Why Use MLlib? Key Features Example ML Pipeline (End-to-End) Commonly Used Classes When to Use PySpark MLlib? (1) What Is PySpark MLLib? (2) Two APIs in MLlib (3) Why Use MLlib? (4) Key Features (5) What is a PySpark Pipeline? model.transform(data) model = pipeline.fit(data) from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import LogisticRegression # Step 1: Convert label to numeric indexer = StringIndexer(inputCol="purchased", outputCol="label") # Step 2: Assemble features assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features") # Step 3: Model lr = LogisticRegression(featuresCol="features",
-

PySpark – PySpark SQL
PySpark – PySpark SQL Table Of Contents: What is PySpark SQL? Why Use PySpark SQL? Setting It Up (Step-by-Step) SQL vs DataFrame APIs (Both Supported!) Advanced Features in PySpark SQL Input Data Formats Performance Optimizations Real-World Use Cases Summary (1) What is PySpark SQL? (2) Why Use PySpark SQL? (3) Setting It Up (Step-by-Step) Step 1: Create a SparkSession from pyspark.sql import SparkSession spark = SparkSession.builder .appName("PySparkSQLDemo") .getOrCreate() Step 2: Load Data into a DataFrame df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() Step 3: Register DataFrame as SQL Table (Temp View) df.createOrReplaceTempView("employees") Step 4: Run SQL Queries! result = spark.sql(""" SELECT
-

PySpark – DataFrames
PySpark – Dataframes Table Of Contents: What Is PySpark DataFrames. Why Use DataFrames In PySpark? How To Create DataFrames In PySSark? Common DataFrame Operations. Lazy Evaluation. Under the Hood: Catalyst & Tungsten (1) What Is PySpark DataFrames. (2) Why Use DataFrames in PySpark? (3) How to Create a DataFrame? From A List: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Example").getOrCreate() data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() From A CSV File: df = spark.read.csv("employees.csv", header=True, inferSchema=True) df.show() (4) Common DataFrame Operations Filtering Rows: df.filter(df.Age > 30).show() Selecting Columns: df.select("Name").show() Group and
-

PySpark – Spark Application Lifecycle Overview
PySpark – Spark Application Lifecycle Overview Table Of Contents: Spark Application Starts Driver Program Is Launched Cluster Manager Allocates Resources Job is Created on Action DAG Scheduler Breaks Job into Stages Tasks are Sent to Executors Results Returned to Driver SparkContext Stops / Application Ends (1) Spark Application Start from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() We need to first initialize a Spark application to enable distributed data processing with Apache Spark. you are initializing a Spark application. This is the entry point for using Spark. (2) Driver Program Is Launched from pyspark.sql import SparkSession # This runs on the
-

PySpark – Apache PySpark Ecosystem Overview.
PySpark – Apache PySpark Ecosystem Overview. Table Of Contents: SparkContext RDD (Resilient Distributed Dataset) DataFrame Spark SQL SparkSession MLlib Spark Streaming / Structured Streaming GraphX / GraphFrames Data Sources & Integration Deployment & Cluster Management PySpark Libraries (1) Spark Context from pyspark import SparkContext sc = SparkContext("local", "MyApp") (2) RDD (Resilient Distributed Dataset) rdd = sc.parallelize([1, 2, 3, 4]) rdd2 = rdd.map(lambda x: x * 2) (3) DataFrame from pyspark.sql import SparkSession spark = SparkSession.builder.appName("App").getOrCreate() df = spark.read.csv("data.csv", header=True) (4) Spark SQL df.createOrReplaceTempView("people") spark.sql("SELECT * FROM people WHERE age > 30").show() (4) SparkSession spark = SparkSession.builder.appName("App").getOrCreate() (5) MLlib from pyspark.ml.classification
-

PySpark – PySpark Vs Pandas Vs Dask .
PySpark – PySpark Vs Pandas Vs Dask Table Of Contents: PySpark Vs Pandas Vs Dask . Use Case-Based Comparison . Summary . (1) PySpark Vs Pandas Vs Dask (2) Use Case-Based Comparison . (3) Summary
-

PySpark – Why Use PySpark Over Python ?
PySpark – Why Use PySpark Over Python ? Table Of Contents: Why Use PySpark Over Python ? Distributed Computing. Big Data Support. Lazy Evaluation. In Built Fault Tolerance. Support For SQL, ML, Streaming and Graphs. Cluster Deployment. Optimized Engine. (1) Why Use PySpark Over Python ?
-

PySpark – What Is PySpark ?
PySpark – What Is Pyspark ? Table Of Contents: What Is PySpark ? What Is Distributed Computing ? What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? How Spark Works On Single Core Device ? (1) What Is PySpark ? (2) What Is Distributed Computing ? (3) What Happens If I Have A Single Computer With Me How The Task Will Get Distributed ? (4) How Spark Works On Single Core Device ?
-

PySpark – Syllabus
PySpark – Syllabus Table Of Contents:
-

NLP – BERT Architecture
NLP – BERT Architecture Table Of Contents: Introduction to BERT BERT Architecture Input Representation Pretraining Objectives Fine-Tuning BERT Variants of BERT BERT Evaluation and Benchmarks Advanced Concepts Implementation with Libraries Limitations and Challenges Applications of BERT (1) Introduction To BERT. (2) BERT – Questions What is BERT and the transformer, and why do I need to understand it? Models like BERT are already massively impacting academia and business, so we’ll outline some of the ways these models are used, and clarify some of the terminology around them. What did we do before these models? To understand these models, it’s important to look
