PySpark – PySpark MLLib
Table Of Contents:
- What is PySpark MLlib?
- Two APIs in MLlib
- Why Use MLlib?
- Key Features
- Example ML Pipeline (End-to-End)
- Commonly Used Classes
- When to Use PySpark MLlib?
(1) What Is PySpark MLLib?
(2) Two APIs in MLlib
(3) Why Use MLlib?
(4) Key Features
(5) What is a PySpark Pipeline?
model.transform(data)
model = pipeline.fit(data)
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Step 1: Convert label to numeric
indexer = StringIndexer(inputCol="purchased", outputCol="label")
# Step 2: Assemble features
assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features")
# Step 3: Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
# Build the pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])
# Fit the pipeline
model = pipeline.fit(data)
# Make predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
(5) Example ML Pipeline (End-to-End)
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# Load and prepare data
data = spark.read.csv("data.csv", header=True, inferSchema=True)
# Feature engineering
assembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features")
# Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
# Pipeline
pipeline = Pipeline(stages=[assembler, lr])
# Train the model
model = pipeline.fit(data)
# Make predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
(6) Commonly Used Classes
(7) When to Use PySpark MLlib?

