Looking back at distribuing computations

Starting with our BIG dataset

The data is split

The data is distributed across a cluster of machines

You can think of your split/distributed data as a single collection]

Important Latency Numbers

Memory vs. Disk

Memory vs. Network

Memory, Disk and Network

MapReduce/Hadoop was groundbreaking

  • It provided a simple API (map and reduce steps)

  • It provided fault tolerance, which made it possible to scale to 100s/1000s of nodes of commodity machines where the likelihood of a node failing midway through a job was very high

    • Computations on very large datasets failed and recovered and jobs completed

Fault tolerance came at a cost!

  • Between each map and reduce step, MapReduce shuffles its data and writes intermediate data to disk
    • Reading/writing to disk is 100x slower than in-memory
    • Network communication is 1,000,000x slower than in-memory

Start your Spark Cluster on AWS

Introducing Spark: a Unified Engine

What is Spark

  • A simple programming model that can capture streaming, batch, and interactive workloads

  • Retains fault-tolerance

  • Uses a different strategy for handling latency: it keeps all data immutable and in memory

  • Provides speed and flexibility

Spark Stack

Connected and extensible

Three data structure APIs

  1. RDDs (Resilient Distributed Datasets)

  2. DataFrames SQL-like structured datasets with query operations

  3. Datasets A mixture of RDDs and DataFrames

We’ll only use RDDs and DataFrames in this course.

Spark Architecture and job flow

Spark vs. Hadoop

Hadoop Limitation Spark Approach
For iterative processes and interactive use, Hadoop and MapReduce's mandatory dumping of output to disk proved to be a huge bottleneck. In ML, for example, users rely on iterative processes to train-test-retrain. Spark uses an in-memory processing paradigm, which lowers the disk IO substantially. Spark uses DAGs to store details of each transformation done on a parallelized dataset and does not process them to get results until required (lazy).
Traditional Hadoop applications needed the data first to be copied to HDFS (or other distributed filesystem) and then did the processing. Spark works equally well with HDFS or any POSIX style filesystem. However, parallel Spark needs the data to be distributed.
Mappers needed a data-localization phase in which the data was written to the local filesystem to bring resilience. Resilience in Spark is brough about by the DAGs, in which a missing RDD is re-calculated by following the path from which the RDD was created.
Hadoop is built on Java and you must use Java to take advantage of all of it's capabilities. Although you can run non-Java scripts with Hadoop Streaming, it is still running a Java Framework. Spark is developed in Scala, and it has a unified API with so you can use Spark with Scala, Java, R and Python.

Introducing the RDD

Example: word count (yes, again!)

The “Hello, World!” of programming with large scale data.

# read data from text file and split each line into words
rdd = sc.textFile("...") 

count = rdd.flatMap(lambda line: line.split(" ")) \ # separate lines into words
                     .map(lambda word: (word, 1)) \ # Add 1 to each word
                     .reduceByKey(lambda a,b:a +b) # sum all the 1's for each key

That’s it!

Transformations and Actions (key Spark concept)

How to create and RDD?

RDDs can be created in two ways:

  • Transforming an existing RDD: just like a call to map on a list returns a new list, many higher order functions defined on RDDs return a new RDD

  • From a SparkContext or SparkSession object: the SparkContext object (renamed SparkSession) can be though of as your handle to the Spark cluster. It represents a connection between the Spark Cluster and your application/client. It defines a handful of methods which can be used to create and populate a new RDD:

    • parallelize: converts a local object into an RDD
    • textFile: reads a text file from your filesystem and returns an RDD of strings

Transformations and Actions

Spark defines transformations and actions on RDDs:

Transformations return new RDDs as results.

Actions compute a result based on an RDD which is either returned or saved to an external filesystem.

Transformations and Actions

Common RDD Transformations

Method Description


Expresses a one-to-one transformation and transforms each element of a collection into one element of the resulting collection


Expresses a one-to-many transformation and transforms each element to 0 or more elements


Applies filter function that returns a boolean and returs an RDD of elements that have passed the filter condition


Returns RDD with duplicates removed

Common RDD Actions

Method Description


Returns all distributed elements of the RDD to the driver


Returns the number of elements in an RDD


Returns the first n elements of the RDD


Combines elements of the RDD together using some function and returns result

collect CAUTION

Another example

Let’s assume that we have an RDD of strings which contains gigabytes of logs collected over the previous year. Each element of the RDD represents one line of logs.

Assuming the dates come in the form YYYY-MM-DD:HH:MM:SS and errors are logged with a prefix that includes the word “error”…

How would you determine the number of errors that were logged in December 2019?

# read data from text file and split each line into words
logs = sc.textFile("...") 

# this is a transformation
errors = logs.filter(lambda x: "error" in x and "2019-12" in x)

# this is an action

Spark computes RDDs the first time they are used in an action!

Caching and Persistence

By default, RDDs are recomputed every time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

To tell spark to cache an object in memory, use persist() or cache():

  • cache(): is a shortcut for using default storage level, which is memory only
  • persist(): can be customized to other ways to persist data (including both memory and/or disk)
# caches error RDD in memory, but only after an action is run
errors = logs.filter(lambda x: "error" in x and "2019-12" in x).cache()

Using memory is great for iterative workloads


DataFrames in a nutshell

DataFrames are…

Datasets organized into named columns

Conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

A relational API over Spark’s RDDs

Because sometimes it’s more convenient to use declarative relational APIs than functional APIs for analysis jobs:

  • select
  • where
  • limit
  • orderBy
  • groupBy
  • join

Able to be automatically aggresively optimized

SparkSQL applies years of research on relational optimizations in the database community to Spark

DataFrame Data Types

SparkSQL’s DataFrames operate on a restricted (yet broad) set of data types. These are the most common:

  • Integer types (at different lengths): ByteType, ShortType, IntegerType, LongType
  • Decimal types: Float, Double
  • BooleanType
  • StringType
  • Date/Time: TimestampType, DateType

A DataFrame

Getting a look at your data

There are a few ways you can have a look at your data in DataFrames:

  • show() pretty-prints DataFrame in tabular form. Shows first 20 elements

  • printSchema() prints the schema of your DataFrame in a tree format.

Common DataFrame Transformations

Like on RDD’s, transformations on DataFrames are:

  1. Operations that return another DataFrame as a results
  2. Are lazily evaluated

Some common transformations include:

Method Description


Selects a set of named columns and returns a new DataFrame with those columns as a result


Performs aggregations on a series of columns and returns a new DataFrame with the calculated output


Groups the DataFrame using the specified columns, usually used before some kind of aggregation


Inner join with another DataFrame

Other transformations include: filter, limit, orderBy, where.

Specifying columns

Most methods take a parameter of type Column or String, always referring to some attribute/column in the the DataFrame.

You can select and work with columns in ways using the DataFrame API:

  1. Using $ notation: df.filter($"age" > 18)

  2. Referring to the DataFrame: df.filter(df("age") > 18)

  3. Using SQL query string: df.filter("age > 18")

Filtering in SparkSQL

The DataFrame API makes two methods available for filtering: filter and where. They are equivalent!

employee_df.filter("age > 30").show()

is equivalent to

employee_df.where("age > 30").show()

Use either DataFrame API and SparkSQL

The DataFrame API and SparkSQL syntax can be used interchangeably!

Example: return the firstname and lastname of all the employees over the age over 25 that reside in Washington D.C.

DataFrame API

results = df.select("firstname", "lastname") \
            .where("city == 'Washington D.C.' && age >= 25")


spark.sql("select firstname, lastname from df_view where city == 'Washington D.C.' and age >= 25")
# * Note: you have to register `df` using `df.createOrReplaceTempView("df_view")`

Grouping and aggregating on DataFrames

Some of the most common tasks on structured data tables include:

  1. Grouping by a certain attributed
  2. Doing some kind of aggregation on the grouping, like a count

For grouping and aggregating, SparkSQL provides a groupBy function which returns a RelationalGroupedDataset which has several standard aggregation functions like count, sum, max, min, and avg.

How to group

  • Call a groupBy on a specific attribute/column of a DataFrame
  • followed by a call to agg
results = df.groupBy("state") \

Actions on DataFrames

Like RDDs, DataFrames also have their own set of actions:

Method Description


Returns an array that contains all the rows in the DataFrame to the driver


Returns the number of rows in a DataFrame


Returns the first row in the DataFrame


Displays the top 20 rows in the DataFrame


Returns the first n rows of the RDD

collect CAUTION

Limitations on DataFrame

  • Can only use DataFrame data types
  • If your unstructured data cannot be reformulated to adhere to some kind of schema, it would be better to use RDDs.

One of the most common performance bottlenecks of new Spark users happens because you re-evaluate several transformations when you could cache intermediate results to memory!
