Spark UDFs and Project Introduction
Georgetown University
Fall 2023
By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell spark to cache an object in memory, use persist()
or cache()
:
cache():
is a shortcut for using default storage level, which is memory onlypersist():
can be customized to other ways to persist data (including both memory and/or disk)https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf
collect
CAUTIONSpark Application UI shows important facts about you Spark job:
Adapted from AWS Glue Spark UI docs and Spark UI docs
Problem: make a new column with ages for adults-only
+-------+--------------+
|room_id| guests_ages|
+-------+--------------+
| 1| [18, 19, 17]|
| 2| [25, 27, 5]|
| 3|[34, 38, 8, 7]|
+-------+--------------+
Adapted from UDFs in Spark
from pyspark.sql.functions import udf, col
@udf("array<integer>")
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
# alternatively
from pyspark.sql.types IntegerType, ArrayType
@udf(returnType=ArrayType(IntegerType()))
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
+-------+----------------+------------+
|room_id| guests_ages | adults_ages|
+-------+----------------+------------+
| 1 | [18, 19, 17] | [18, 19]|
| 2 | [25, 27, 5] | [25, 27]|
| 3 | [34, 38, 8, 7] | [34, 38]|
| 4 |[56, 49, 18, 17]|[56, 49, 18]|
+-------+----------------+------------+
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
# define the function that can be tested locally
def squared(s):
return s * s
# wrap the function in udf for spark and define the output type
squared_udf = udf(squared, LongType())
# execute the udf
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
spark.udf.register("squaredWithPython", squared)
select id, squaredWithPython(id) as id_squared from test
Costs:
Other ways to make your Spark jobs faster source:
From PySpark docs - Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required. A Pandas UDF behaves as a regular PySpark function API in general.
@pandas_udf("first string, last string")
def split_expand(s: pd.Series) -> pd.DataFrame:
return s.str.split(expand=True)
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(split_expand("name")).show()
+------------------+
|split_expand(name)|
+------------------+
| [John, Doe]|
+------------------+
from pyspark.sql.functions import udf
# Use udf to define a row-at-a-time udf
@udf('double')
# Input/output are both a single double value
def plus_one(v):
return v + 1
df.withColumn('v2', plus_one(df.v))
Input of the user-defined function:
Output of the user-defined function:
Grouping semantics:
Output size:
Deliverable | Deadline |
---|---|
Project EDA Milestone | 2023-11-06 11:59pm |
Project NLP Milestone | 2023-11-20 11:59pm |
Project Peer Feedback | 2023-11-20 11:59pm |
Project ML Milestone | 2023-11-30 11:59pm |
Final Project Milestone | 2023-12-08 11:59pm |
EDA - project plan, initial data exploration, summary graphs and tables, some data transformation
NLP - external dataset merging, more data transformation, leverage an NLP model
Peer feedback - review another group’s projet and provide constructive feedback on their EDA milestone.
ML - build several ML models, compare performance, answer some interesting questions
Final delivery - complete output that takes into account feedback given by instructors and peers for your project, improved analysis and work from intermediate deliverables
DSAN 6000 | Fall 2023 | https://gu-dsan.github.io/6000-fall-2023/