SparkML
Georgetown University
Fall 2023
Credit limit - $100
Course numbers:
STAY WITH COURSE 54380 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!
Note that you will have to repeat several setup steps:
By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell spark to cache an object in memory, use persist()
or cache()
:
cache():
is a shortcut for using default storage level, which is memory onlypersist():
can be customized to other ways to persist data (including both memory and/or disk)collect
CAUTIONGather and collect data
Clean and inspect data
Perform feature engineering
Split data into train/test
Evaluate and compare models
MLlib
org.apache.spark.ml
(High-level API) - Provides higher level API built on top of DataFrames
- Allows the construction of ML pipelines
org.apache.spark.mllib
(Predates DataFrames
- in maintenance mode) - Original API built on top of RDDs
scikit-learn
DataFrame
Transformer
Estimator
Pipeline
Parameter
Transformers
takes DataFrames
as input, and return a new DataFrame as output. Transformers
do not learn any parameters from the data, they simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained model.
Transformers
are run using the .transform()
method
StringIndexer
OneHotEncoder
can act on multiple columns at a timeNormalizer
StandardScaler
Tokenizer
StopWordsRemover
PCA
Bucketizer
QuantileDiscretizer
MLlib
algorithms for machine learning models need single, numeric features column as inputEach row in this column contains a vector of data points corresponding to the set of features used for prediction.
Use the VectorAssembler
transformer to create a single vector column from a list of columns.
All categorical data needs to be numeric for machine learning
# Example from Spark docs
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
Estimators
learn (or “fit”) parameters from your DataFrame via the .fit()
method, and return a model which is a Transformer
Pipelines
combine multiple steps into a single workflow that can be easily run. * Data cleaning and feature processing via transformers, using stages
* Model definition * Run the pipeline to do all transformations and fit the model
The Pipeline
constructor takes an array of pipeline stages
Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a Pipeline, you won’t need to manually keep track of your training and validation data at each step.
Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale, but Pipelines can help.
More Options for Model Validation: We can easily apply cross-validation and other techniques to our Pipelines.
Using the HMP Dataset. The structure of the dataset looks like this:
StringIndexer
is an estimator having both fit
and transform
methods.class
column as inputCol, and classIndex
as outputCol.from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol = 'class', outputCol = 'classIndex')
indexed = indexer.fit(df).transform(df) # This is a new data frame
# Let's see it
indexed.show(5)
+---+---+---+-----------+--------------------+----------+
| x| y| z| class| source|classIndex|
+---+---+---+-----------+--------------------+----------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|
+---+---+---+-----------+--------------------+----------+
only showing top 5 rows
StringIndexer
, OneHotEncoder
is a pure transformer, having only the transform
method. It uses the same syntax of inputCol
and outputCol
.OneHotEncoder
creates a brand new DataFrame (encoded), with a category_vec
column added to the previous DataFrame(indexed).OneHotEncoder
doesn’t return several columns containing only zeros and ones; it returns a sparse-vector as seen in the categoryVec column. Thus, for the ‘Drink_glass’ class above, SparkML returns a sparse-vector that says there are 13 elements, and at position 2, the class value exists(1.0).from pyspark.ml.feature import OneHotEncoder
# The OneHotEncoder is a pure transformer object. it does not use the fit()
encoder = OneHotEncoder(inputCol = 'classIndex', outputCol = 'categoryVec')
encoded = encoder.transform(indexed) # This is a new data frame
encoded.show(5, False)
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|x |y |z |class |source |classIndex|categoryVec |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
only showing top 5 rows
VectorAssembler
object gets initialized with the same syntax as StringIndexer
and OneHotEncoder
.['x', 'y', 'z']
is passed to inputCols
, and we specify outputCol = ‘features’.from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# VectorAssembler creates vectors from ordinary data types for us
vectorAssembler = VectorAssembler(inputCols = ['x','y','z'], outputCol = 'features')
# Now we use the vectorAssembler object to transform our last updated dataframe
features_vectorized = vectorAssembler.transform(encoded) # note this is a new df
features_vectorized.show(5, False)
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|x |y |z |class |source |classIndex|categoryVec |features |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,39.0,51.0]|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,39.0,51.0]|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[28.0,38.0,52.0]|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[29.0,37.0,51.0]|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0 |(13,[2],[1.0])|[30.0,38.0,52.0]|
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
only showing top 5 rows
Normalizer
, like all transformers, has a consistent syntax. Looking at the Normalizer
object, it contains the parameter p=1.0
(the default norm value for Pyspark Normalizer is p=2.0
.)p=1.0
uses Manhattan Distancep=2.0
uses Euclidean Distancefrom pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol = 'features', outputCol = 'features_norm', p=1.0) # Manhattan Distance
normalized_data = normalizer.transform(features_vectorized) # New data frame too.
normalized_data.show(5)
>>
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| x| y| z| class| source|classIndex| categoryVec| features| features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
only showing top 5 rows
Pipeline
The Pipeline
constructor below takes an array of Pipeline stages we pass to it. Here we pass the 4 stages defined earlier, in the right sequence, one after another.
Now the Pipeline
object fits to our original DataFrame df
Finally, we transform our DataFrame df
using the Pipeline Object.
pipelined_data = data_model.transform(df)
pipelined_data.show(5)
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| x| y| z| class| source|classIndex| categoryVec| features| features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...| 2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
only showing top 5 rows
# first let's list out the columns we want to drop
cols_to_drop = ['x','y','z','class','source','classIndex','features']
# Next let's use a list comprehension with conditionals to select cols we need
selected_cols = [col for col in pipelined_data.columns if col not in cols_to_drop]
# Let's define a new train_df with only the categoryVec and features_norm cols
df_train = pipelined_data.select(selected_cols)
# Let's see our training dataframe.
df_train.show(5)
+--------------+--------------------+
| categoryVec| features_norm|
+--------------+--------------------+
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.23728813559322...|
|(13,[2],[1.0])|[0.24786324786324...|
|(13,[2],[1.0])|[0.25,0.316666666...|
only showing top 5 rows
There are many Spark machine learning models
Linear regression - what is the optimization method?
Survival regression
Generalized linear model (GLM)
Random forest regression
Feature selection using easy R-like formulas
# Example from Spark docs
from pyspark.ml.feature import RFormula
dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"])
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")
output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
Evaluating 1,000s or 10,000s of variables?
Pick categorical variables that are most dependent on the response variable
Can even check the p-values of specific variables!
# Example from Spark docs
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures())
result.show()
# Example from Spark docs
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
.select("features", "label", "prediction")\
.show()
Review project overview https://gu-dsan.github.io/6000-fall-2023/project/project.html
Assignment link - https://georgetown.instructure.com/courses/172712/assignments/958317
DSAN 6000 | Fall 2023 | https://gu-dsan.github.io/6000-fall-2023/