PySpark – PySpark MLLib

Table Of Contents:

  1. What is PySpark MLlib?
  2. Two APIs in MLlib
  3. Why Use MLlib?
  4. Key Features
  5. Example ML Pipeline (End-to-End)
  6. Commonly Used Classes
  7. When to Use PySpark MLlib?

(1) What Is PySpark MLLib?

(2) Two APIs in MLlib

(3) Why Use MLlib?

(4) Key Features

(5) What is a PySpark Pipeline?

model.transform(data)
model = pipeline.fit(data)
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Step 1: Convert label to numeric
indexer = StringIndexer(inputCol="purchased", outputCol="label")

# Step 2: Assemble features
assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features")

# Step 3: Model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Build the pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])

# Fit the pipeline
model = pipeline.fit(data)

# Make predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()

(5) Example ML Pipeline (End-to-End)

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# Load and prepare data
data = spark.read.csv("data.csv", header=True, inferSchema=True)

# Feature engineering
assembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features")

# Model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Train the model
model = pipeline.fit(data)

# Make predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()

(6) Commonly Used Classes

(7) When to Use PySpark MLlib?

Leave a Reply

Your email address will not be published. Required fields are marked *