Lecture 7

Spark UDFs and Project Introduction

Amit Arora, Abhijit Dasgupta, Anderson Monken, and Marck Vaisman

Georgetown University

Fall 2023

Agenda and Goals for Today

Lecture

Spark Diagnostics
Spark UDFs
Project

Lab

Spark on Sagemaker
Spark DataFrames
Spark UDFs

Logistics and Review

Deadlines

Assignment 5: DuckDB & Polars Due Oct 2 11:59pm
Lab 6: Intro to Spark Due Oct 3 6pm
Lab 7: Spark DataFrames Due Oct 10 6pm
Lab 8: SparkNLP Due Oct 17 6pm
Assignment 6: Spark (Multi-part) Due Oct 23 11:59pm
Lab 9: SparkML Due Oct 24 6pm
Lab 10: Spark Streaming Due Oct 31 6pm
Lab 11: Dask Due Nov 7 6pm
Lab 12: Ray Due Nov 14 6pm

Look back and ahead

Searching Slack for existing Q&A - like StackOverflow!
Spark RDDs, Spark DataFrames
Now: More Spark and Project
Next week: SparkNLP and more Project

Spark: a Unified Engine

Connected and extensible

Caching and Persistence

By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

To tell spark to cache an object in memory, use persist() or cache():

cache(): is a shortcut for using default storage level, which is memory only
persist(): can be customized to other ways to persist data (including both memory and/or disk)

# caches error RDD in memory, but only after an action is run
errors = logs.filter(lambda x: "error" in x and "2019-12" in x).cache()

Review of PySparkSQL Cheatsheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf

`collect` CAUTION

Review of htop

htop top section explanation

htop bottom section explanation

https://codeahoy.com/2017/01/20/hhtop-explained-visually/

Spark Diagnostic UI

Understanding how the cluster is running your job

Spark Application UI shows important facts about you Spark job:

Event timeline for each stage of your work
Directed acyclical graph (DAG) of your job
Spark job history
Status of Spark executors
Physical / logical plans for any SQL queries

Tool to confirm you are getting the horizontal scaling that you need!

Adapted from AWS Glue Spark UI docs and Spark UI docs

Spark UI - Event timeline

Spark UI - DAG

Spark UI - Job History

Spark UI - Executors

Spark UI - SQL

Demo of Spark UI Diagnostics in AzureML

PySpark User Defined Functions

UDF Workflow

UDF Code Structure

Clear input - a single row of data with one or more columns used

Function - some work written in python that process the input using python syntax. No PySpark needed!

Clear output - output with a scoped data type

UDF Example

Problem: make a new column with ages for adults-only

+-------+--------------+
|room_id|   guests_ages|
+-------+--------------+
|      1|  [18, 19, 17]|
|      2|   [25, 27, 5]|
|      3|[34, 38, 8, 7]|
+-------+--------------+

Adapted from UDFs in Spark

UDF Code Solution

from pyspark.sql.functions import udf, col

@udf("array<integer>")
   def filter_adults(elements):
   return list(filter(lambda x: x >= 18, elements))

# alternatively
from pyspark.sql.types IntegerType, ArrayType
@udf(returnType=ArrayType(IntegerType()))
def filter_adults(elements):
   return list(filter(lambda x: x >= 18, elements))

+-------+----------------+------------+
|room_id| guests_ages    | adults_ages|
+-------+----------------+------------+
| 1     | [18, 19, 17]   |    [18, 19]|
| 2     | [25, 27, 5]    |    [25, 27]|
| 3     | [34, 38, 8, 7] |    [34, 38]|
| 4     |[56, 49, 18, 17]|[56, 49, 18]|
+-------+----------------+------------+

Alternative to Spark UDF

# Spark 3.1
from pyspark.sql.functions import col, filter, lit

df.withColumn('adults_ages',
              filter(col('guests_ages'), lambda x: x >= lit(18))).show()

Another UDF Example

Separate function definition form

from pyspark.sql.functions import udf
from pyspark.sql.types import LongType

# define the function that can be tested locally
def squared(s):
  return s * s

# wrap the function in udf for spark and define the output type
squared_udf = udf(squared, LongType())

# execute the udf
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

Single function definition form

from pyspark.sql.functions import udf
@udf("long")
def squared_udf(s):
  return s * s
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

Can also refer to a UDF in SQL

spark.udf.register("squaredWithPython", squared)
select id, squaredWithPython(id) as id_squared from test

Consider all the corner cases
Where could the data be null or an unexpected value
Leverage python control structure to handle corner cases

source

UDF Speed Comparison

Costs:

Serialization/deserialization (think pickle files)
Data movement between JVM and Python
Less Spark optimization possible

Other ways to make your Spark jobs faster source:

Cache/persist your data into memory
Using Spark DataFrames over Spark RDDs
Using Spark SQL functions before jumping into UDFs
Save to serialized data formats like Parquet

Pandas UDF

From PySpark docs - Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required. A Pandas UDF behaves as a regular PySpark function API in general.

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
+--------------+
|to_upper(name)|
+--------------+
|      JOHN DOE|
+--------------+

Another example

@pandas_udf("first string, last string")
def split_expand(s: pd.Series) -> pd.DataFrame:
    return s.str.split(expand=True)


df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(split_expand("name")).show()
+------------------+
|split_expand(name)|
+------------------+
|       [John, Doe]|
+------------------+

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html

Scalar Pandas UDFs

Vectorizing scalar operations - one plus one
Pandas UDF needs to have same size input and output series

UDF Form

from pyspark.sql.functions import udf

# Use udf to define a row-at-a-time udf
@udf('double')
# Input/output are both a single double value
def plus_one(v):
      return v + 1

df.withColumn('v2', plus_one(df.v))

Pandas UDF Form - faster vectorized form

from pyspark.sql.functions import pandas_udf, PandasUDFType

# Use pandas_udf to define a Pandas UDF
@pandas_udf('double', PandasUDFType.SCALAR)
# Input/output are both a pandas.Series of doubles

def pandas_plus_one(v):
    return v + 1

df.withColumn('v2', pandas_plus_one(df.v))