Lecture 10

Visualizing Big Data

Abhijit Dasgupta, Jeff Jacobs, Anderson Monken, and Marck Vaisman

Georgetown University

Spring 2024

Where we’ve been

  • Designing for an audience
  • Visual encoding and graphical integrity
  • Building the right visualization
  • Finalizing conceptual and design considerations
  • Making readable graphics
  • HTML, CSS, JS and d3.js
  • Plotly, Altair, Vega/Vega-Lite, and Bokeh
  • Geospatial visualizations
  • Network visualizations

Today

Visualizing Big Data

  • Millions or even billions of observations
  • Looking for patterns, often multivariate
    • time series
    • geospatial orientation
    • closeness on networks
  • Not usually looking at univariate distributions or characteristics

Challenges of visualizing Big Data

Challenges

  • You’re dealing with millions, or even billions, of observations
  • You’re looking for patterns in a HUGE cloud of data

Primary issues

  1. Computational speed: plotting large data sets results in very slow rendering and will likely crash/freeze the rendering from your plotting library
  2. Image quality: Rendering, interactivity, overplotting, undersaturation, etc.

Challenges

The primary challenge is computation

  • You have to map every point to a location on your canvas
  • Depending on the visual encoding you want, this may be difficult
    • Compute inter-relationships and then figure out a way to place every point in a part of a network
    • Map each point to a latitude-longitude and then incorporate particular projections to make sure the map is proportionate and accurate
    • Even a scatter plot will require mapping millions of points to a coordinate system
  • On top of this coordinate system issue, you can certainly have more visual encodings to map, including shape and color of the marks, or computing summarising lines (like loess or regression lines)

Challenges

A simple experiment: Draw a scatter plot

Increase N from 10 to 100,000,000 (100 Million)

This is sort of the simplest challenge you could do.

With more complexity (like maps or networks), things fall apart much faster

Another simple experiment

Timing summary: Trying to plotting sine curves with different N (\(2^k\), k=1,…,26)

Code
import matplotlib.pyplot as plt
import numpy as np
import time
N_i=[]; T_i=[]
for i in range(1,26):
  N=2**i
  # Data for plotting
  t = np.linspace(0.0, 2*np.pi, N)
  s = np.sin(2 * np.pi * t)
  #START TIME
  to = time.process_time()
  fig, ax = plt.subplots()
  ax.plot(t, s,'-o')
  ax.set(xlabel='time (s)', ylabel='sin(x)')
  ax.grid()
  fig.savefig("big-data/img/test.png")
  # plt.show()
  #REPORT TIME NEEDED TO PLOT
  tf=time.process_time() - to
  #print(i,N,np.round(tf,2))
  N_i.append(N); T_i.append(tf)
  #ONLY SHOW NEXT PLOT
  plt.close("all") #this is the line to be added
#PLOT N VS TIME
fig, ax = plt.subplots()
ax.plot(N_i, T_i,'-o')
ax.set(xlabel='Number of points', ylabel='Time (s)')
ax.grid()
plt.show()

Some voodoo (or so it might appear)

x <- rnorm(1e8) # 100 Million
y <- 3 + 2*x - 5*(x^2)
d <- data.frame('x'=x, 'y'=y)
start = Sys.time()
p <- d %>% rasterly(mapping = aes(x, y)) %>% 
  rasterly_points()
p
times <- (Sys.time() -start)

This takes only 3.38 seconds!!!

Note we don’t have axes or the like, but we’ll fix that in a bit.

We’ll also be able to create interactive visualization using this voodoo

Let’s see some more!

library(fst)
rides <- read.fst(file.path(currdir, 'rides.fst')) 

This is a data set of Uber trips in New York City between April and September, 2014 (Source)

This data has 4,534,327 rows and 3 columns

p <- rides %>% 
  rasterly(mapping= aes(x = Lon, y = Lat)) %>% 
  rasterly_points() 
p

This takes 0.26 seconds

This voodoo creates a compressed representation of the data (details coming up), but doesn’t actually display the data it generates. We can use ggplot2, if we like.

rides <- rides |> 
   mutate(hour = lubridate::hour(`Date/Time`))
ggRasterly(data = rides, 
           mapping = aes(x = Lat, y = Lon, color = hour),
           color = hourColors_map) + 
  labs(title = "New York Uber",
       subtitle = "Apr to Sept, 2014",
       caption = "Data from https://raw.githubusercontent.com/plotly/datasets/master")+
  hrbrthemes::theme_ipsum()

We can make this interactive!!!

plot_ly(rides,  x = ~Lon, y = ~Lat) |> 
   add_rasterly_heatmap()

What are the colors representing?

The represent the frequency of rides (on a log scale)

Zoom in

Introducing a Python toolset: HoloViz

HoloViz

HoloViz is a set of Python packages that makes data visualization easier, more accurate and more powerful

  • Panel for making apps and dashboards
  • hvPlot to quickly generate interactive plots from data
  • HoloViews to help make all your data easily visualizable
    • leverages matplotlib, bokeh and plotly
  • GeoViews extends HoloViews to geographic data
  • Datashader for rendering even the largest datasets
  • Lumen for building data-driven dashboards from YAML specifications
  • Param to create delarative user-configurable objects
  • Colorcet for perceptually uniform colormaps

hvPlot

Allowed inputs

  • Pandas: DataFrrame, Series (columnar/tidy data)
  • Rapids cuDF: GPU DataFrame, Series
  • Dask: Distributed DataFrame and Series
  • XArray: Dataset, DataArray (labelled multi-dimensional array)
  • Streamz: DataFrame, Series (streaming columnar data)
  • Intake: Data Source (data catalogues)
  • GeoPandas: GeoDataFrame (geometry data)
  • NetworkX: Graph/Network data.

Issues with visualizing large data

Common pitfalls made worse by large data

  1. Overplotting
  2. Oversaturation
  3. Undersampling
  4. Undersaturation
  5. Underutilized range
  6. Nonuniform colormapping

Overplotting

Let’s look at an example

n <- 300; offset <- 0.5
d <- data.frame(
  x = c(rnorm(n, mean=offset ), rnorm(n, mean = -offset)),
  y = c(rnorm(n, mean=offset), rnorm(n, mean=-offset)),
  z = rep(c(1,-1), c(n,n))
)
theme_set(hrbrthemes::theme_ipsum())
a = ggplot(d[d$z==1,], aes(x,y))+geom_point(color='blue')
b = ggplot(d[d$z==-1,], aes(x,y)) + geom_point(color='red')
e = ggplot() + geom_point(data=d[d$z==1,], aes(x,y), color='blue') +
  geom_point(data = d[d$z==-1,], aes(x,y), color='red')
f = ggplot() + geom_point(data = d[d$z==-1,], aes(x,y), color='red') +
  geom_point(data=d[d$z==1,], aes(x,y), color='blue') 
cowplot::plot_grid(a,b,e,f, labels = c('A',"B",'C', 'D'))

Overplotting

# ggsave('big-data/img/overplot1.png')

Overplotting

This is a common issue when you have many points/curves/densities to plot.

You’re overlaying one data point on top of the other, and so you’re hiding data.

Another common example:

n <- 5000
d2 <- data.frame(x = rnorm(n), y = rnorm(n))
ggplot(d2, aes(x, y))+geom_point()

There is a greater mass of points towards the center, but all you see is one relatively uniform blob, because of overplotting

A solution to overplotting: Transparency

You can play with the transparency or opacity of your plot to somewhat reduce the problems created with overplotting

ggplot(d2, aes(x,y))+geom_point(alpha = 0.1)

You get a better sense of where the data is, and, more importantly, how much there is,

A solution to overplotting: Transparency

Transparency

There are still problems:

  • If alpha = 0.1, then 10 overlayed points will lead to full saturation, again making it harder to see individual points
    • Only the last 10 points plotted over each other influences the color (for alpha = 0.1)
  • A good alpha setting is conditional on the data; even small changes in the data can lead to oversaturation in regions at particular alpha levels

So,even for a small dataset, these issues come to the fore.

Sampling (and undersampling)

Another strategy with large data is to sample the data and visualize a representative subset.

You can also make the actual points smaller so that there is less chance of overlap.

Both strategies work sometimes

  • The “small dots” setting (size = 0.1, alpha = 1) has serious overplotting with 60K points
  • The “tiny dots” setting (size 0.01, alpha = 0.1) makes the 600 points disappear

Remember this one?

When we zoom in, we see the strategy we’ll explore from here on

Rasterization

Rasterization

Derived from the German Raster (Framework, schema), from the Latin rastrum, “scraper” – Wikipedia

Rasterization (or rasterisation) is the task of taking an image described in a vector graphics format (shapes) and converting it into a raster image (a series of pixels, dots or lines, which, when displayed together, create the image which was represented via shapes)

We don’t just rasterise typically in data visualization, but also summarise or aggregate the data.

Resolving overplotting and undersampling: heatmaps

  • These are 2D histograms visualized as heatmaps
  • Provides a fixed-size grid regardless of data size
  • The chosen grid determines the resolution you can see
    • Coarser heatmaps average out noise to reveal signal
    • Finer heatmaps can show more structure

  • A is too coarse
  • B is able to show the distributions (2 circular Gaussians)
  • C can approximate a scatter plot

Breaking the heatmap: undersaturation

Both heatmaps and alpha-based scatterplots can suffer from undersaturation

Plotting a composite of 5 Gaussian distributions (n = 10K, each) with different centers and spreads

  • In A, the large spread of one dominates
  • In B and C, we can start seeing the structure
    • B has the medium spread Gaussian dominate, despite identical sample sizes
    • C makes the grid size smaller & adds transparency, making the large spread Gaussian disappear (undersaturation)

Breaking the heatmap: undersaturation

Using heatmaps isn’t much help.

  • Narrow-spread distributions lead to pixels with very high count, which now dominates
  • Wide-spread distribution disappear

Breaking the heatmap: undersaturation

A possible solution: Add an offset, so that all points are represented and distinguished from background

Not using the full range

The distribution of the counts is strongly skew, with some points having very large values. When this is converted to a color- or gray-scale, the entire range is not fully utilized.

One way to deal with this is to make the distribution less skew. One way of doing this is using a log-transform (among many other choices)

Not using the full range

We can now see the structure of all 5 Gaussians, in B and C, and relative spreads in C.

Further improvements: histogram equalization

Instead of binning and determining colors based on the actual value, one can select the color based on percentile of the counts. If you have 100 gray values, you map the lowest 1% of the data values to the first color, and the largest 1% of the data to the last color, and so on

This then creates a rank-order plot, preserving the order but not the magnitudes of the counts

Now things pop

Note that you can’t reconstruct the original data from this anymore.

One more piece: colormaps

One problem with many colormaps is that they are perceptually non-uniform, so that you can’t visually distinguish colors over a large range of values. A prototypical example is the jet colormap, once the default in matplotlib and Matlab.

The figure B allows more detail to be shown, than in the A figure, which uses a non-uniform colormap.

There are several perceptually uniform colormaps available

  • viridis and friends in R and Python
  • the Python colorcet package and the R cetcolor package

Summarizing the pitfalls

datashader

Introduction

Datashader is a Python package for analyzing and visualizing large datasets

  • The basic strategy is to use the bin-summarize paradigm to rasterize or aggregate datasets into regular grids
  • These grids are essentially a data reduction/compression method, reducing millions of points into a 400x400 grid, say
  • Some good default choices to avoid the pitfalls described earlier allows visual analysis of large data sets quickly
  • datashader produces an image, that can then be embedded within other plotting tools like matplotlib, plotly, and bokeh

Taxi trips in New York City, 2014. 12 million pickup/dropoff locations with passenger counts

Introduction

datashader basically runs an analytic pipeline to transform data into a rasterized image.

The ultimate deliverable for the datashader package is the image, which can then be plotted by embedding it into a plotting system.

We will also look at rasterly, which essentially does the same thing in the R ecosystem

Example: Flight delays (Kaggle)

This dataset comprises 5.8 million rows

import plotly.express as px
import pandas as pd
import numpy as np
import datashader as ds
df = pd.read_parquet('https://raw.githubusercontent.com/plotly/datasets/master/2015_flights.parquet')
df.shape
(5819079, 4)

Example: Flight delays (Kaggle)

Interest is in whether whether departure delays are dependent on when flights are scheduled to leave.

df_sub = df[::200] # take 200th row, so 29K rows
fig = px.scatter(data_frame = df_sub, x = "SCHEDULED_DEPARTURE", y = "DEPARTURE_DELAY")
fig.show()

Example: Flight delays (Kaggle)

cvs = ds.Canvas(plot_width=100, plot_height=100)
agg = cvs.points(df, 'SCHEDULED_DEPARTURE', 'DEPARTURE_DELAY')
zero_mask = agg.values == 0
agg.values = np.log10(agg.values, where=np.logical_not(zero_mask))
agg.values[zero_mask] = np.nan

This is the basic datashader process

  • Create a canvas (which will define the grid)
  • Aggregate within each pixel (default is count(), but others possible)
  • If a pixel has 0 counts (or no data), make it a missing value (np.nan) and mask it

Optionally, we have also

  • log-transformed the counts so that we make better use of the dynamic range of values

There are other aggregates that you can plot using datashader. Several examples are provided here, including maximum and minimum values for data represented in each pixel.

Example: Flight delays (Kaggle)

fig = px.imshow(agg, origin='lower', labels={'color':'Log10(count)'})
fig.show()

All datashader is enabling is reducing this large dataset into an aggregated dataset over a smaller, manageable grid of points, which is stored as a matrix (technically an xarray)

This matrix is then rendered as an image

Example: Flight delays (Kaggle)

fig.update_traces(hoverongaps=False)
fig.update_layout(coloraxis_colorbar=dict(title='Count', tickprefix='1.e'))
fig.show()

This is now plotly magic that adds a legend and spiffies things up

rasterly

rasterly

rasterly does for R what datashader does for Python.

(well, not exactly, but it’s aspirational)

Let’s look at the same Kaggle flights data

rasterly

pacman::p_load(tidyverse, arrow, rasterly)
df <- read_parquet(here::here('slides/big-data/2015_flights.parquet')) %>% 
  janitor::clean_names() %>% 
  filter(complete.cases(.))
dim(df)
[1] 5714008       4

rasterly

We will first make a static visualization using ggplot2

theme_set(hrbrthemes::theme_ipsum())
ggplot(
   df %>% filter(row_number() %% 200 ==1), 
   aes(x=scheduled_departure, y = departure_delay)
)+
   geom_point()

rasterly

We will now put rasterly to use

ggRasterly(df, aes(x=scheduled_departure, y = departure_delay), 
           color=viridis_map)

rasterly

We can also make it interactive using plotly

plt <- plotRasterly(df, 
  aes(x=scheduled_departure, 
      y = departure_delay), 
  color=viridis_map)

Leveraging modern tools for big data visualization

Data tools

You’ll see many of these in DSAN 6000

Need: Scalable storage and processing of big data

Two tools:

  1. The parquet format for efficient columnar storage, compression and parallelization.

    • Standard in the Hadoop ecosystem
    • Can be read using functions in the Apache Arrow project (arrow in and pyarrow in )
  2. The DuckDB in-process analytical database

    • Uses SQL for processing
    • Deep integration with various tools (including Python, R, tidyverse, WASM)
    • Super-fast

Leveraging these tools for visualization

  • The main approach to big data visualization is the bin-summarize method to create raster images showing data relationships.
  • There is also the need for interaction and cross-filters, which require quick data processing, re-evaluation and re-rasterization

Using parquet for data storage, streaming it into DuckDB and re-processing the data makes “real-time” interactions possible

Mosaic is an extensible framework for linking databases and interactive views, developed recently by the Heer lab at University of Washington in collaboration with the Perer/Moritz lab at Carnegie Mellon.

  • Interface components (visualizations, tables, widgets, etc) publish data needs as queries
  • Queries managed by a central coordinator
  • Queries issued to a back-end data source (DuckDB by default)

Mosaic comes with several front end packages for developers:

  • vgplot: A grammar of graphics-based package for plotting and building interactive visualizations and dashboards. It renders visualizations in SVG output using Observable Plot.
  • mosaic-spec: Declarative specification of Mosaic visualizations using YAML or JSON (much like Vega and Vega-Lite)
  • duckdb-server: A Python-based server running a local DuckDB instance, supporting queries over Web Sockets or HTTP, returning data in either Arrow or JSON formats
  • mosaic-widget: A jupyter widget for Mosaic to render vgplot visualization sin Jupyter notebook cells, processing data by DuckDB in the Python kernel

Looking at cross-filtering of 50,000 sample points

For this visualization we’re using Observable Plot, and running Mosaic using a DuckDB backend running using WebAssembly

Code
viewof rw = {
  await vg.coordinator().exec(`CREATE TABLE IF NOT EXISTS randomWalk AS
    SELECT *
    FROM 'https://uwdata.github.io/mosaic-datasets/data/random-walk.parquet'`);

const brush = vg.Selection.intersect();
const plot1 = vg.plot(
  vg.areaY(
    vg.from('randomWalk'),
    {x: "t", y: "v", fill: "steelblue"}
  ),
  vg.intervalX({as: brush}),
  vg.width(680),
  vg.height(200)
);

const plot2 = vg.plot(
vg.areaY(
  vg.from('randomWalk', {filterBy: brush}),
  {x: "t", y: "v", fill: "steelblue"}
  ),
  vg.yDomain(vg.fixed),
  vg.width(680),
  vg.height(200)
);

return vg.vconcat(
  plot1,
  plot2
);
}

Cross-filtering histograms derived from 10 million flights: arrival delays, departure times and distance flown

In this example, we read the data as a Parquet file residing in GitHub, and create 3 linked histograms that allow cross-filtering.

Code
viewof flights = {
  // load flights data from external parquet file
  await vg.coordinator().exec(`CREATE TABLE IF NOT EXISTS flights10m AS
    SELECT
      GREATEST(-60, LEAST(ARR_DELAY, 180))::DOUBLE AS delay,
      DISTANCE AS distance,
      DEP_TIME AS time
    FROM 'https://uwdata.github.io/mosaic-datasets/data/flights-10m.parquet'`);

  // create a selection with crossfilter resolution
  const brush = vg.Selection.crossfilter();

  // helper method to generate a binned plot filtered by brush
  // a plot contains a rectY mark for a histogram, as well as
  // an intervalX interactor to populate the brush selection
  const makePlot = column => vg.plot(
    vg.rectY(
      vg.from("flights10m", { filterBy: brush }), // data set and filter selection
      { x: vg.bin(column), y: vg.count(), fill: "steelblue", inset: 0.5 }
    ),
    vg.intervalX({ as: brush }), // create an interval selection brush
    vg.xDomain(vg.Fixed), // don't change the x-axis domain across updates
    vg.marginLeft(75),
    vg.width(600),
    vg.height(130)
  );

  // generate dashboard with three linked histograms
  return vg.vconcat(
    makePlot("delay"),
    makePlot("time"),
    makePlot("distance")
  );
}

NYC Taxi Rides

Pickup and dropoff points for 1M NYC taxi rides on Jan 1-3, 2010 (this may take a bit of time to load)

Filter on the bar plot and see updates in the maps

Code
viewof taxi = {
await vg.coordinator().exec([
  vg.loadExtension("spatial"),
  vg.loadParquet("rides", "https://uwdata.github.io/mosaic-datasets/data/nyc-rides-2010.parquet", {
  select: [
  "pickup_datetime::TIMESTAMP AS datetime",
  "ST_Transform(ST_Point(pickup_latitude, pickup_longitude), 'EPSG:4326', 'ESRI:102718') AS pick",
  "ST_Transform(ST_Point(dropoff_latitude, dropoff_longitude), 'EPSG:4326', 'ESRI:102718') AS drop"
]
}),
  `CREATE TEMP TABLE IF NOT EXISTS trips AS SELECT
  (HOUR(datetime) + MINUTE(datetime)/60) AS time,
  ST_X(pick) AS px, ST_Y(pick) AS py,
  ST_X(drop) AS dx, ST_Y(drop) AS dy
FROM rides`
]);

const filter = vg.Selection.crossfilter();

const plot1 = vg.plot(
      vg.raster(
        vg.from("trips", {filterBy: filter}),
        {x: "px", y: "py", bandwidth: 0}
      ),
      vg.intervalXY({as: filter}),
      vg.text(
        [{label: "Taxi Pickups"}],
        {
          dx: 10,
          dy: 10,
          text: "label",
          fill: "black",
          fontSize: "1.2em",
          frameAnchor: "top-left"
        }
      ),
      vg.width(335),
      vg.height(550),
      vg.margin(0),
      vg.xAxis(null),
      vg.yAxis(null),
      vg.xDomain([975000, 1005000]),
      vg.yDomain([190000, 240000]),
      vg.colorScale("symlog"),
      vg.colorScheme("blues")
    );
const plot2 = vg.plot(
      vg.raster(
        vg.from("trips", {filterBy: filter}),
        {x: "dx", y: "dy", bandwidth: 0}
      ),
      vg.intervalXY({as: filter}),
      vg.text(
        [{label: "Taxi Dropoffs"}],
        {
          dx: 10,
          dy: 10,
          text: "label",
          fill: "black",
          fontSize: "1.2em",
          frameAnchor: "top-left"
        }
      ),
      vg.width(335),
      vg.height(550),
      vg.margin(0),
      vg.xAxis(null),
      vg.yAxis(null),
      vg.xDomain([975000, 1005000]),
      vg.yDomain([190000, 240000]),
      vg.colorScale("symlog"),
      vg.colorScheme("oranges")
    );

const plot3 =   vg.plot(
    vg.rectY(
      vg.from("trips"),
      {x: vg.bin("time"), y: vg.count(), fill: "steelblue", inset: 0.5}
    ),
    vg.intervalX({as: filter}),
    vg.yTickFormat("s"),
    vg.xLabel("Pickup Hour →"),
    vg.width(680),
    vg.height(100)
  );
 
 return vg.vconcat(
    vg.hconcat(
        plot1, vg.hspace(10), plot2
    ),
    vg.vspace(10),
    plot3
);
}