SparkNLP and Project Setup
Georgetown University
Fall 2023
Credit limit - $100
Course numbers:
STAY WITH COURSE 54380 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!
Note that you will have to repeat several setup steps:
By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell spark to cache an object in memory, use persist()
or cache()
:
cache():
is a shortcut for using default storage level, which is memory onlypersist():
can be customized to other ways to persist data (including both memory and/or disk)collect
CAUTIONCosts:
Other ways to make your Spark jobs faster source:
Transformers
take DataFrames
as input, and return a new DataFrame as output. Transformers
do not learn any parameters from the data, they simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained model.
Transformers
are run using the .transform()
method
Estimators
learn (or “fit”) parameters from your DataFrame via the .fit()
method, and return a model which is a Transformer
String manipulation SQL functions: F.length(col)
, F.substring(str, pos, len)
, F.trim(col)
, F.upper(col)
, …
ML transformers: Tokenizer()
, StopWordsRemover()
, Word2Vec()
, CountVectorizer()
>>> df = spark.createDataFrame([("a b c",)], ["text"])
>>> tokenizer = Tokenizer(outputCol="words")
>>> tokenizer.setInputCol("text")
>>> tokenizer.transform(df).head()
Row(text='a b c', words=['a', 'b', 'c'])
>>> # Change a parameter.
>>> tokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text='a b c', tokens=['a', 'b', 'c'])
>>> # Temporarily modify a parameter.
>>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
Row(text='a b c', words=['a', 'b', 'c'])
>>> tokenizer.transform(df).head()
Row(text='a b c', tokens=['a', 'b', 'c'])
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"])
>>> remover = StopWordsRemover(stopWords=["b"])
>>> remover.setInputCol("text")
>>> remover.setOutputCol("words")
>>> remover.transform(df).head().words == ['a', 'c']
True
>>> stopWordsRemoverPath = temp_path + "/stopwords-remover"
>>> remover.save(stopWordsRemoverPath)
>>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath)
>>> loadedRemover.getStopWords() == remover.getStopWords()
True
>>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive()
True
>>> loadedRemover.transform(df).take(1) == remover.transform(df).take(1)
True
>>> df2 = spark.createDataFrame([(["a", "b", "c"], ["a", "b"])], ["text1", "text2"])
>>> remover2 = StopWordsRemover(stopWords=["b"])
>>> remover2.setInputCols(["text1", "text2"]).setOutputCols(["words1", "words2"])
>>> remover2.transform(df2).show()
+---------+------+------+------+
| text1| text2|words1|words2|
+---------+------+------+------+
|[a, b, c]|[a, b]|[a, c]| [a]|
+---------+------+------+------+
Example: find the A
’s within a certain distance of a Y
# within2 -> 0
X X X X Y X X X A
# within2 -> 1
X X A X Y X X X A
# within2 -> 2
X A X A Y A X X A
# within4 -> 3
A X A X Y X X X A
Y
’s in the text that have an A
near enough to themTokenizer()
and StopWordsRemover()
Why use UDFs, run proximity-based sentiment? Let’s use more advanced natural language processing packages!
Which libraries have the most features?
Just because it is scalable does not mean it lacks features!
Why??
Aaaaannndddd…. it scales!
Reusing the Spark ML Pipeline
Unified NLP & ML pipelines
End-to-end execution planning
Serializable
Distributable
Reusing NLP Functionality
TF-IDF calculation
String distance calculation
Stop word removal
Topic modeling
Distributed ML algorithms
fit()
method to make an Annotator Model or Transformertransform()
method only.pretained()
methodNo!! They only add columns.
PretrainedPipeline.transform()
PretrainedPipeline.annotate()
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler()\
.setInputCol(“text”)\
.setOutputCol(“document”)
sentenceDetector = SentenceDetector()\
.setInputCols([“document”])\
.setOutputCol(“sentences”)
tokenizer = Tokenizer() \
.setInputCols([“sentences”]) \
.setOutputCol(“token”)
normalizer = Normalizer()\
.setInputCols([“token”])\
.setOutputCol(“normal”)
word_embeddings=WordEmbeddingsModel.pretrained()\
.setInputCols([“document”,”normal”])\
.setOutputCol(“embeddings”)
import sparknlp; from sparknlp.base import *
from sparknlp.annotator import *; from pyspark.ml import Pipeline
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
result.select("document").printSchema
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer
from sparknlp.annotator import *
from sparknlp.base import *
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
sequenceClassifier = DistilBertForSequenceClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),spark)\
.setInputCols(["document",'token'])\
.setOutputCol("class").setCaseSensitive(True).setMaxSentenceLength(128)
#### WHERE IS THIS SAVING TO???
sequenceClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
Required:
Encouraged:
Review project overview https://gu-dsan.github.io/6000-fall-2023/project/project.html
Assignment link - https://georgetown.instructure.com/courses/172712/assignments/958317
DSAN 6000 | Fall 2023 | https://gu-dsan.github.io/6000-fall-2023/