Lecture 5

File types, file systems, and working with large tabular data on a single node

Amit Arora, Abhijit Dasgupta, Anderson Monken, and Marck Vaisman

Georgetown University

Fall 2023

Agenda and Goals for Today

Lecture

Lab

Logistics and Review

Deadlines

Assignment 4: Containers Due Sept 25 11:59pm
Lab 5: DuckDB & Polars Due Sept 26 6pm

Look back and ahead

Looking back at distribuing computations

Starting with our BIG dataset

The data is split

The data is distributed across a cluster of machines

You can think of your split/distributed data as a single collection]

Important Latency Numbers

Memory vs. Disk

Memory vs. Network

Memory, Disk and Network

MapReduce/Hadoop was groundbreaking

It provided a simple API (map and reduce steps)
It provided fault tolerance, which made it possible to scale to 100s/1000s of nodes of commodity machines where the likelihood of a node failing midway through a job was very high
- Computations on very large datasets failed and recovered and jobs completed

Fault tolerance came at a cost!

Between each map and reduce step, MapReduce shuffles its data and writes intermediate data to disk
- Reading/writing to disk is 100x slower than in-memory
- Network communication is 1,000,000x slower than in-memory

Start your Spark Cluster on AWS

Introducing Spark: a Unified Engine

What is Spark

A simple programming model that can capture streaming, batch, and interactive workloads
Retains fault-tolerance
Uses a different strategy for handling latency: it keeps all data immutable and in memory
Provides speed and flexibility

Spark Stack

Connected and extensible

Three data structure APIs

RDDs (Resilient Distributed Datasets)
DataFrames SQL-like structured datasets with query operations
Datasets A mixture of RDDs and DataFrames

We’ll only use RDDs and DataFrames in this course.

Spark Architecture and job flow

Spark vs. Hadoop

Hadoop Limitation	Spark Approach
For iterative processes and interactive use, Hadoop and MapReduce's mandatory dumping of output to disk proved to be a huge bottleneck. In ML, for example, users rely on iterative processes to train-test-retrain.	Spark uses an in-memory processing paradigm, which lowers the disk IO substantially. Spark uses DAGs to store details of each transformation done on a parallelized dataset and does not process them to get results until required (lazy).
Traditional Hadoop applications needed the data first to be copied to HDFS (or other distributed filesystem) and then did the processing.	Spark works equally well with HDFS or any POSIX style filesystem. However, parallel Spark needs the data to be distributed.
Mappers needed a data-localization phase in which the data was written to the local filesystem to bring resilience.	Resilience in Spark is brough about by the DAGs, in which a missing RDD is re-calculated by following the path from which the RDD was created.
Hadoop is built on Java and you must use Java to take advantage of all of it's capabilities. Although you can run non-Java scripts with Hadoop Streaming, it is still running a Java Framework.	Spark is developed in Scala, and it has a unified API with so you can use Spark with Scala, Java, R and Python.

Introducing the RDD

Example: word count (yes, again!)

The “Hello, World!” of programming with large scale data.

# read data from text file and split each line into words
rdd = sc.textFile("...") 

count = rdd.flatMap(lambda line: line.split(" ")) \ # separate lines into words
                     .map(lambda word: (word, 1)) \ # Add 1 to each word
                     .reduceByKey(lambda a,b:a +b) # sum all the 1's for each key

That’s it!

Transformations and Actions (key Spark concept)

How to create and RDD?

RDDs can be created in two ways:

Transforming an existing RDD: just like a call to map on a list returns a new list, many higher order functions defined on RDDs return a new RDD
From a SparkContext or SparkSession object: the SparkContext object (renamed SparkSession) can be though of as your handle to the Spark cluster. It represents a connection between the Spark Cluster and your application/client. It defines a handful of methods which can be used to create and populate a new RDD:
- parallelize: converts a local object into an RDD
- textFile: reads a text file from your filesystem and returns an RDD of strings

Transformations and Actions

Spark defines transformations and actions on RDDs:

Transformations return new RDDs as results.

Actions compute a result based on an RDD which is either returned or saved to an external filesystem.

Transformations and Actions

Spark defines transformations and actions on RDDs:

Transformations return new RDDs as results.
Transfomations are lazy, their result RDD is not immediately computed.

Actions compute a result based on an RDD which is either returned or saved to an external filesystem.
Actions are eager, their result is immediately computed.

Common `RDD` Transformations

Method	Description
`map`	Expresses a one-to-one transformation and transforms each element of a collection into one element of the resulting collection
`flatMap`	Expresses a one-to-many transformation and transforms each element to 0 or more elements
`filter`	Applies filter function that returns a boolean and returs an RDD of elements that have passed the filter condition
`distinct`	Returns RDD with duplicates removed

Common `RDD` Actions

Method	Description
`collect`	Returns all distributed elements of the `RDD` to the driver
`count`	Returns the number of elements in an `RDD`
`take`	Returns the first n elements of the `RDD`
`reduce`	Combines elements of the `RDD` together using some function and returns result

`collect` CAUTION

Another example

Let’s assume that we have an RDD of strings which contains gigabytes of logs collected over the previous year. Each element of the RDD represents one line of logs.

Assuming the dates come in the form YYYY-MM-DD:HH:MM:SS and errors are logged with a prefix that includes the word “error”…

How would you determine the number of errors that were logged in December 2019?

# read data from text file and split each line into words
logs = sc.textFile("...") 

# this is a transformation
errors = logs.filter(lambda x: "error" in x and "2019-12" in x)

# this is an action
errors.count()

Spark computes RDDs the first time they are used in an action!

Caching and Persistence

By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

To tell spark to cache an object in memory, use persist() or cache():

cache(): is a shortcut for using default storage level, which is memory only
persist(): can be customized to other ways to persist data (including both memory and/or disk)

# caches error RDD in memory, but only after an action is run
errors = logs.filter(lambda x: "error" in x and "2019-12" in x).cache()

Using memory is great for iterative workloads

DataFrames

DataFrames in a nutshell

DataFrames are…

Datasets organized into named columns

Conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

A relational API over Spark’s RDDs

Because sometimes it’s more convenient to use declarative relational APIs than functional APIs for analysis jobs:

select
where
limit

orderBy
groupBy
join

Able to be automatically aggresively optimized

SparkSQL applies years of research on relational optimizations in the database community to Spark

DataFrame Data Types

SparkSQL’s DataFrames operate on a restricted (yet broad) set of data types. These are the most common:

Integer types (at different lengths): ByteType, ShortType, IntegerType, LongType
Decimal types: Float, Double
BooleanType
StringType
Date/Time: TimestampType, DateType

A `DataFrame`

Getting a look at your data

There are a few ways you can have a look at your data in DataFrames:

show() pretty-prints DataFrame in tabular form. Shows first 20 elements
printSchema() prints the schema of your DataFrame in a tree format.

Common `DataFrame` Transformations

Like on RDD’s, transformations on DataFrames are:

Operations that return another DataFrame as a results
Are lazily evaluated

Some common transformations include:

Method	Description
`select`	Selects a set of named columns and returns a new `DataFrame` with those columns as a result
`agg`	Performs aggregations on a series of columns and returns a new `DataFrame` with the calculated output
`groupBy`	Groups the DataFrame using the specified columns, usually used before some kind of aggregation
`join`	Inner join with another DataFrame

Other transformations include: filter, limit, orderBy, where.

Specifying columns

Most methods take a parameter of type Column or String, always referring to some attribute/column in the the DataFrame.

You can select and work with columns in ways using the DataFrame API:

Using $ notation: df.filter($"age" > 18)
Referring to the DataFrame: df.filter(df("age") > 18)
Using SQL query string: df.filter("age > 18")

Filtering in SparkSQL

The DataFrame API makes two methods available for filtering: filter and where. They are equivalent!

employee_df.filter("age > 30").show()

is equivalent to

employee_df.where("age > 30").show()

Use either DataFrame API and SparkSQL

The DataFrame API and SparkSQL syntax can be used interchangeably!

Example: return the firstname and lastname of all the employees over the age over 25 that reside in Washington D.C.

DataFrame API

results = df.select("firstname", "lastname") \
            .where("city == 'Washington D.C.' && age >= 25")

SparkSQL

spark.sql("select firstname, lastname from df_view where city == 'Washington D.C.' and age >= 25")
          
# * Note: you have to register `df` using `df.createOrReplaceTempView("df_view")`

Grouping and aggregating on DataFrames

Some of the most common tasks on structured data tables include:

Grouping by a certain attributed
Doing some kind of aggregation on the grouping, like a count

For grouping and aggregating, SparkSQL provides a groupBy function which returns a RelationalGroupedDataset which has several standard aggregation functions like count, sum, max, min, and avg.

How to group

Call a groupBy on a specific attribute/column of a DataFrame
followed by a call to agg

results = df.groupBy("state") \
            .agg(sum("sales"))

Actions on DataFrames

Like RDDs, DataFrames also have their own set of actions:

Method	Description
`collect`	Returns an array that contains all the rows in the DataFrame to the driver
`count`	Returns the number of rows in a DataFrame
`first`	Returns the first row in the DataFrame
`show`	Displays the top 20 rows in the DataFrame
`take`	Returns the first n rows of the RDD

`collect` CAUTION

Limitations on `DataFrame`

Can only use DataFrame data types
If your unstructured data cannot be reformulated to adhere to some kind of schema, it would be better to use RDDs.

One of the most common performance bottlenecks of new Spark users happens because you re-evaluate several transformations when you could cache intermediate results to memory!

Lecture 5

Agenda and Goals for Today

Lecture

Lab

Logistics and Review

Deadlines

Look back and ahead

Looking back at distribuing computations

Starting with our BIG dataset

The data is split

The data is distributed across a cluster of machines

Important Latency Numbers

Memory vs. Disk

Memory vs. Network

Memory, Disk and Network

MapReduce/Hadoop was groundbreaking

Fault tolerance came at a cost!

Start your Spark Cluster on AWS

Introducing Spark: a Unified Engine

What is Spark

Spark Stack

Connected and extensible

Three data structure APIs

Spark Architecture and job flow

Spark vs. Hadoop

Introducing the RDD

Example: word count (yes, again!)

Transformations and Actions (key Spark concept)

How to create and RDD?

Transformations and Actions

Transformations and Actions

Common RDD Transformations

Common RDD Actions

collect CAUTION

Another example

Caching and Persistence

Using memory is great for iterative workloads

DataFrames

DataFrames in a nutshell

DataFrame Data Types

A DataFrame

Getting a look at your data

Common DataFrame Transformations

Some common transformations include:

Specifying columns

Filtering in SparkSQL

Use either DataFrame API and SparkSQL

DataFrame API

SparkSQL

Grouping and aggregating on DataFrames

How to group

Actions on DataFrames

collect CAUTION

Limitations on DataFrame

One of the most common performance bottlenecks of new Spark users happens because you re-evaluate several transformations when you could cache intermediate results to memory!

Lab

Common `RDD` Transformations

Common `RDD` Actions

`collect` CAUTION

A `DataFrame`

Common `DataFrame` Transformations

`collect` CAUTION

Limitations on `DataFrame`