DataBricks and Spark NLP
Georgetown University
Fall 2025
Project Introduction
Spark NLP
Lab:
Credit limit - $100
Course numbers:
STAY WITH COURSE 24178 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!
Note that you will have to repeat several setup steps:
What are the possible file system options for each item: S3, HDFS, Local file system
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar #1
-files basic-mapper.py,basic-reducer.py #2
-input /user/hadoop/in_data #3
-output /user/hadoop/in_data #3
-mapper basic-mapper.py #4
-reducer basic-reducer.py #4
-files
) in the pwd in distributed containersBy default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell spark to cache an object in memory, use persist()
or cache()
:
cache():
is a shortcut for using default storage level, which is memory onlypersist():
can be customized to other ways to persist data (including both memory and/or disk)collect
CAUTIONCosts:
Other ways to make your Spark jobs faster source:
Transformers
take DataFrames
as input, and return a new DataFrame as output. Transformers
do not learn any parameters from the data, they simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained model.
Transformers
are run using the .transform()
method
Estimators
learn (or “fit”) parameters from your DataFrame via the .fit()
method, and return a model which is a Transformer
Pipelines
combine multiple steps into a single workflow that can be easily run. * Data cleaning and feature processing via transformers, using stages
* Model definition * Run the pipeline to do all transformations and fit the model
The Pipeline
constructor takes an array of pipeline stages
Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a Pipeline, you won’t need to manually keep track of your training and validation data at each step.
Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale, but Pipelines can help.
More Options for Model Validation: We can easily apply cross-validation and other techniques to our Pipelines.
Deadline for Project Milestone 1: EDA
Monday November 7
String manipulation SQL functions: F.length(col)
, F.substring(str, pos, len)
, F.trim(col)
, F.upper(col)
, …
ML transformers: Tokenizer()
, StopWordsRemover()
, Word2Vec()
, CountVectorizer()
>>> df = spark.createDataFrame([("a b c",)], ["text"])
>>> tokenizer = Tokenizer(outputCol="words")
>>> tokenizer.setInputCol("text")
>>> tokenizer.transform(df).head()
Row(text='a b c', words=['a', 'b', 'c'])
>>> # Change a parameter.
>>> tokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text='a b c', tokens=['a', 'b', 'c'])
>>> # Temporarily modify a parameter.
>>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
Row(text='a b c', words=['a', 'b', 'c'])
>>> tokenizer.transform(df).head()
Row(text='a b c', tokens=['a', 'b', 'c'])
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"])
>>> remover = StopWordsRemover(stopWords=["b"])
>>> remover.setInputCol("text")
>>> remover.setOutputCol("words")
>>> remover.transform(df).head().words == ['a', 'c']
True
>>> stopWordsRemoverPath = temp_path + "/stopwords-remover"
>>> remover.save(stopWordsRemoverPath)
>>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath)
>>> loadedRemover.getStopWords() == remover.getStopWords()
True
>>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive()
True
>>> loadedRemover.transform(df).take(1) == remover.transform(df).take(1)
True
>>> df2 = spark.createDataFrame([(["a", "b", "c"], ["a", "b"])], ["text1", "text2"])
>>> remover2 = StopWordsRemover(stopWords=["b"])
>>> remover2.setInputCols(["text1", "text2"]).setOutputCols(["words1", "words2"])
>>> remover2.transform(df2).show()
+---------+------+------+------+
| text1| text2|words1|words2|
+---------+------+------+------+
|[a, b, c]|[a, b]|[a, c]| [a]|
+---------+------+------+------+
Example: find the A
’s within a certain distance of a Y
# within2 -> 0
X X X X Y X X X A
# within2 -> 1
X X A X Y X X X A
# within2 -> 2
X A X A Y A X X A
# within4 -> 3
A X A X Y X X X A
Y
’s in the text that have an A
near enough to themTokenizer()
and StopWordsRemover()
Why use UDFs, run proximity-based sentiment? Let’s use more advanced natural language processing packages!
Which libraries have the most features?
Just because it is scalable does not mean it lacks features!
Why??
Aaaaannndddd…. it scales!
Reusing the Spark ML Pipeline
Unified NLP & ML pipelines
End-to-end execution planning
Serializable
Distributable
Reusing NLP Functionality
TF-IDF calculation
String distance calculation
Stop word removal
Topic modeling
Distributed ML algorithms
fit()
method to make an Annotator Model or Transformertransform()
method only.pretained()
methodNo!! They only add columns.
PretrainedPipeline.transform()
PretrainedPipeline.annotate()
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler()\
.setInputCol(“text”)\
.setOutputCol(“document”)
sentenceDetector = SentenceDetector()\
.setInputCols([“document”])\
.setOutputCol(“sentences”)
tokenizer = Tokenizer() \
.setInputCols([“sentences”]) \
.setOutputCol(“token”)
normalizer = Normalizer()\
.setInputCols([“token”])\
.setOutputCol(“normal”)
word_embeddings=WordEmbeddingsModel.pretrained()\
.setInputCols([“document”,”normal”])\
.setOutputCol(“embeddings”)
import sparknlp; from sparknlp.base import *
from sparknlp.annotator import *; from pyspark.ml import Pipeline
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document").printSchema
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer
from sparknlp.annotator import *
from sparknlp.base import *
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
sequenceClassifier = DistilBertForSequenceClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),spark)\
.setInputCols(["document",'token'])\
.setOutputCol("class").setCaseSensitive(True).setMaxSentenceLength(128)
#### WHERE IS THIS SAVING TO???
sequenceClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
Required:
Encouraged: