Spark Machine Learning
Georgetown University
Fall 2025
Review of Distributed Hardware and Spark RDDs
Spark ML
Project Introduction
Lab:
Quiz on Hadoop Streaming and HDFS (due Friday October 14)
Credit limit - $100
Course numbers:
STAY WITH COURSE 24178 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!
Note that you will have to repeat several setup steps:
You are responsible for your grades
If there is a problem, you need to bring it up
All assignments are mandatory
State of assignments from syllabus
Group Project: 30%
HW Assignments: 45%
Lab Deliverables: 15% - Halfway through
Quizzes: 10% - One to two more
What are the possible file system options for each item: S3, HDFS, Local file system
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar #1
-files basic-mapper.py,basic-reducer.py #2
-input /user/hadoop/in_data #3
-output /user/hadoop/in_data #3
-mapper basic-mapper.py #4
-reducer basic-reducer.py #4
By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell spark to cache an object in memory, use persist()
or cache()
:
cache():
is a shortcut for using default storage level, which is memory onlypersist():
can be customized to other ways to persist data (including both memory and/or disk)https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf
collect
CAUTIONSpark Application UI shows important facts about you Spark job:
Adapted from AWS Glue Spark UI docs and Spark UI docs
Adapted from UDFs in Spark
Problem: make a new column with ages for adults-only
+-------+--------------+
|room_id| guests_ages|
+-------+--------------+
| 1| [18, 19, 17]|
| 2| [25, 27, 5]|
| 3|[34, 38, 8, 7]|
+-------+--------------+
from pyspark.sql.functions import udf, col
@udf("array<integer>")
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
# alternatively
from pyspark.sql.types IntegerType, ArrayType
@udf(returnType=ArrayType(IntegerType()))
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
+-------+----------------+------------+
|room_id| guests_ages | adults_ages|
+-------+----------------+------------+
| 1 | [18, 19, 17] | [18, 19]|
| 2 | [25, 27, 5] | [25, 27]|
| 3 | [34, 38, 8, 7] | [34, 38]|
| 4 |[56, 49, 18, 17]|[56, 49, 18]|
+-------+----------------+------------+
Costs:
Other ways to make your Spark jobs faster source:
From PySpark docs - Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required. A Pandas UDF behaves as a regular PySpark function API in general.
@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
return s.str.upper()
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
+--------------+
|to_upper(name)|
+--------------+
| JOHN DOE|
+--------------+
@pandas_udf("first string, last string")
def split_expand(s: pd.Series) -> pd.DataFrame:
return s.str.split(expand=True)
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(split_expand("name")).show()
+------------------+
|split_expand(name)|
+------------------+
| [John, Doe]|
+------------------+
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html
Gather and collect data
Clean and inspect data
Perform feature engineering
Split data into train/test
Evaluate and compare models
MLlib
org.apache.spark.ml
(High level API) - Provides higher level API built on top of DataFrames
- Allows the construction of ML pipelines
org.apache.spark.mllib
(Predates DataFrames
) - Original API built on top of RDDs
scikit-learn
DataFrame
Transformer
Estimator
Pipeline
Parameter
Transformers
take DataFrames
as input, and return a new DataFrame as output. Transformers
do not learn any parameters from the data, they simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained model.
Transformers
are run using the .transform()
method
StringIndexer
OneHotEncoder
can act on multiple columns at a timeNormalizer
StandardScaler
Tokenizer
StopWordsRemover
PCA
Bucketizer
QuantileDiscretizer
MLlib
algorithms for machine learning models need single, numeric features column as inputEach row in this column contains a vector of data points corresponding to the set of features used for prediction.
Use the VectorAssembler
transformer to create a single vector column from a list of columns.
All categorical data needs to be numeric for machine learning
# Example from Spark docs
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
Estimators
learn (or “fit”) parameters from your DataFrame via the .fit()
method, and return a model which is a Transformer
Pipelines
combine multiple steps into a single workflow that can be easily run. * Data cleaning and feature processing via transformers, using stages
* Model definition * Run the pipeline to do all transformations and fit the model
The Pipeline
constructor takes an array of pipeline stages
Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a Pipeline, you won’t need to manually keep track of your training and validation data at each step.
Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale, but Pipelines can help.
More Options for Model Validation: We can easily apply cross-validation and other techniques to our Pipelines.
Using the HMP Dataset. The structure of the dataset looks like this:
StringIndexer
is an estimator having both fit
and transform
methods.class
column as inputCol, and classIndex
as outputCol.from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol = 'class', outputCol = 'classIndex')
indexed = indexer.fit(df).transform(df) # This is a new data frame
# Let's see it
indexed.show(5)
+---+---+---+-----------+--------------------+----------+
| x| y| z| class| source|classIndex|
+---+---+---+-----------+--------------------+----------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|
+---+---+---+-----------+--------------------+----------+
only showing top 5 rows
StringIndexer
, OneHotEncoder
is a pure transformer, having only the transform
method. It uses the same syntax of inputCol
and outputCol
.OneHotEncoder
creates a brand new DataFrame (encoded), with a category_vec
column added to the previous DataFrame(indexed).OneHotEncoder
doesn’t return several columns containing only zeros and ones; it returns a sparse-vector as seen in the categoryVec column. Thus, for the ‘Drink_glass’ class above, SparkML returns a sparse-vector that basically says there are 13 elements, and at position 2, the class value exists(1.0).from pyspark.ml.feature import OneHotEncoder
# The OneHotEncoder is a pure transformer object. it does not use the fit()
encoder = OneHotEncoder(inputCol = 'classIndex', outputCol = 'categoryVec')
encoded = encoder.transform(indexed) # This is a new data frame
encoded.show(5, False)
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|x |y |z |class |source |classIndex|categoryVec |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
only showing top 5 rows
VectorAssembler
object gets initialized with the same syntax as StringIndexer
and OneHotEncoder
.['x', 'y', 'z']
is paseed to inputCols
, and we specify outputCol = ‘features’.from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# VectorAssembler creates vectors from ordinary data types for us
vectorAssembler = VectorAssembler(inputCols = ['x','y','z'], outputCol = 'features')
# Now we use the vectorAssembler object to transform our last updated dataframe
features_vectorized = vectorAssembler.transform(encoded) # note this is a new df
features_vectorized.show(5, False)
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|x |y |z |class |source |classIndex|categoryVec |features |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,39.0,51.0]|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,39.0,51.0]|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[28.0,38.0,52.0]|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,37.0,51.0]|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[30.0,38.0,52.0]|
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
only showing top 5 rows
Normalizer
like all transformers have consistent syntax. Looking at the Normalizer
object, it contains the parameter p=1.0
(the default norm value for Pyspark Normalizer is p=2.0
.)p=1.0
uses Manhattan Distancep=2.0
uses Euclidean Distancefrom pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol = 'features', outputCol = 'features_norm', p=1.0) # Manhattan Distance
normalized_data = normalizer.transform(features_vectorized) # New data frame too.
normalized_data.show(5)
>>
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| x| y| z| class| source|classIndex| categoryVec| features| features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
only showing top 5 rows
Pipeline
The Pipeline
constructor below takes an array of Pipeline stages we pass to it. Here we pass the 4 stages defined earlier, in the right sequence, one after another.
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [indexer,encoder,vectorAssembler,normalizer])
Now the Pipeline
object fit to our original DataFrame df
Finally, we transform our DataFrame df
using the Pipeline Object.
pipelined_data = data_model.transform(df)
pipelined_data.show(5)
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| x| y| z| class| source|classIndex| categoryVec| features| features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
only showing top 5 rows
# first let's list out the columns we want to drop
cols_to_drop = ['x','y','z','class','source','classIndex','features']
# Next let's use a list comprehension with conditionals to select cols we need
selected_cols = [col for col in pipelined_data.columns if col not in cols_to_drop]
# Let's define a new train_df with only the categoryVec and features_norm cols
df_train = pipelined_data.select(selected_cols)
# Let's see our training dataframe.
df_train.show(5)
+--------------+--------------------+
| categoryVec| features_norm|
+--------------+--------------------+
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.23728813559322...|
|(13,[2],[1.0])|[0.24786324786324...|
|(13,[2],[1.0])|[0.25,0.316666666...|
only showing top 5 rows
There are many Spark machine learning models
Linear regression - what is the optimization method?
Survival regression
Genearlized linear model (GLM)
Random forest regression
Feature selection using easy R-like formulas
# Example from Spark docs
from pyspark.ml.feature import RFormula
dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"])
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")
output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
Evaluating 1,000s or 10,000s of variables?
Pick categorical variables that are most dependent on the response variable
Can even check the p-values of specific variables!
# Example from Spark docs
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures())
result.show()
# Example from Spark docs
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show()