Lecture 1

Course intro, expectations, why we visualize, gallery of bad visualizations, designing for an audience, visualization best-practices

Abhijit Dasgupta, Jeff Jacobs, Anderson Monken, and Marck Vaisman

Georgetown University

Spring 2024

Agenda and Goals for Today

Lecture

Course introduction and expectations
Why we visualize
Thinking of our audience
Start thinking of best practices
What NOT to do (one of many)

Lab

Data wrangling for visualization
ggplot, matplotlib and seaborn review

Your instructors

Marck Vaisman

AI & ML Cloud Architect and Data Scientist at Microsoft
Teaching at Georgetown since 2016 & GWU since 2015
Co-Founder of DataCommunityDC
R Fanatic

Fun Facts

I love music and try to play music at the beginning of class, typically EDM. Other genres I love are Latin, Bluegrass and Chill
I speak fluent Spanish, I grew up in Venezuela
Love beer & bourbon
Goofball
Westie owner
I can speak like Donald Duck

Anderson Monken

Data Science Manager at Federal Reserve Board of Governors
- Team uses big data, web development, software development, machine learning, and AI
- Technology and cloud initiatives
- Research focus in international trade and economics
Adjunct Professor since 2022
DSAN Program Graduate

Fun Facts

Amateur car mechanic on several old BMWs
Can solve a Rubik’s Cube in under a minute
Canoe’d over 200 miles in Canada

Abhijit Dasgupta

Data Science Associate Director at AstraZeneca supporting Oncology R&D
- bioinformatics, biomarkers, clinical studies
- autoencoders, survival analysis, signal processing
Adjunct Professor at Georgetown since 2020
R and reproducible research evangelist
- Python is cool too!!
Co-founder of Statistical Programming DC (with Marck Vaisman)

Fun Facts

I’m a 4th degree black belt in Aikido,
- over 30 years experience providing flyer miles
Exploring global whiskey, currently on Japan
Active in community theater, mainly behind but sometimes on-stage.

Jeff Jacobs

Assistant Teaching Professor since 2023
Finished up PhD in Political Economy from Columbia University in NYC in 2022
Research papers: NLP for empirical studies of labor economics, game theory for normative models of justice/domination/exploitation in labor markets
Dissertation: Using NLP to study history of political thought as a series of rhetorical “wars of words”, from 18th century to present

Fun Facts

Born and raised ~5 minutes east of Georgetown campus!
Passion project: Teaching CS + Design Thinking classes each summer in refugee camps in Gaza, the West Bank, and one time in northern Syria
Favorite hobbies: Making moody computer music, reading books that I don’t have to read to procrastinate reading books that I do have to read
Lover of all animals but especially my cat Biko

About the Course

Course description

This course explores the art and science of data visualization from the ground up.

We discuss the power of visualizations to communicate data in an engaging, informative, and accessible way. Drawing from statistics, graphic design, and computer and information science techniques, you will learn to think critically about data, choose optimal static and dynamic visualization methods, recognize dishonest visualization techniques, and distinguish between visualizations used for data exploration and publication.

We cover popular visualization libraries in R and Python, front-end web development tools for digital publishing with D3.js and other JavaScript libraries, and commercial applications like Tableau.

By the end of the course, you will have a deep understanding of data visualization techniques, design principles for generating publication-quality graphics, manipulating data for analysis and visualization, and designing compelling visualizations for different purposes and audiences.

Learning objectives

Increase your data visualization vocabulary
Understand what comes before and after creating visualizations
Think critically about data
Distinguish between using visualizations for data exploration or presentation
Manipulate and arrange data for the purposes of analysis and visualization
Design effective visualizations for different purposes and audiences
Apply a set of rules to create highly effective and engaging data visualizations
Understand the role of data visualization within data science and for solving problems
Build dynamic visualizations designed for digital and interactive consumption
Use an array of tools including R, Python, Tableau, Javascript and web tools (HTML, CSS, etc.) to execute all of the above. This includes particular packages and libraries like ggplot2, matplotlib/seaborn, plotly,Observable Plot, Vega/VegaLite/Altair, and others.

Our roadmap

Data types

Quantitative
Qualitative or discrete
Dates and times
Text
Maps, geographic, location based
Relationships (graphs and networks)
Combinations of all

Getting there

Static visualizations
Exploratory data analysis
Interactive visualiztions
Storytelling and visual narratives
Dashboarding

Software

We will mainly use two scripting languages in this course:

R
- In particular, we will make heavy use of the tidyverse group of packages and its philosophy
Python
- packages in the PyData ecosystem, primarily pandas, matplotlib, seaborn.

Given we are in 2024:

D3.js primarily via higher level Javascript libraries like plotly and Observable Plot as well as R and Python wrappers

And for those who want point-and-click (but how to make reproducible?):

Tableau, a popular data visualization software.

Expectations

We have high expectations and strong opinions!

This is a professional, graduate level program.
You must use excellent design practices and make choices that support them for a clean and professional presentation.
Your spelling and grammar must be accurate.
All work should look professional, and have requisite titles, labels, source references, annotations and other textual elements in regular English
- No data-variable or data-names are allowed. Titles and legends must show context.
- They should also be thematically in a style you have created, not in one of the default styles found in ggplot2, ggthemes, hrbrthemes, matplotlib, seaborn or any other package we have covered in the course.
- Visualizations should also be understandable on their own. Brief legends and annotations are allowed.

You will develop your own visual theme. No default settings are permitted, period. Default settings are not usually the best. Make your visualization both beautiful and useful, and your own.

YOUR WORK MUST BE PUBLICATION READY!

It ain’t that easy

Data visualization is neither simple nor easy. It requires thought, creativity and understanding both the data and the topic/aspect/answer you are trying to express through the visualization. Give yourself ample time to think, explore and experiment. It is very easy for us to determine when you didn’t give yourself enough time and “phoned it in”.

Starting out is easy, especially with the current software tools.
However, getting to presentation/publication quality takes a lot of effort and/or tweaking

Let’s avoid this

Bookmark these links!

Course website: https://gu-dsan.github.io/5200-spring-2024/
Syllabus, especially course policies
GitHub Organization for your deliverables: https://github.com/gu-dsan5200/
Slack Workspace: https://dsan5200spring2024.slack.com
Instructors email: dsan5200-instructors@georgetown.edu
Canvas: https://georgetown.instructure.com/courses/182993

These will also be pinned on the Slack main channel

We’re gonna have an awesome time!

Brief history of data visualization

Where did it start?

The original “data visualizers” were map makers, astronomers and navigators

The scientific revolution

The industrial revolution

1854: John Snow cholera clusters in London

Geospatial mapping of homes with cholera cases
One of the first cases of spatial epidemiology
Visually identified clusters of incident cases around a particular water source

Industrial revolution

1856: Florence Nightingale: Causes of mortality in the Army in the East

Convince generals that most deaths were preventable and not directly from war wounds

War time

1869: Minard: Flow of Napoleon’s army during his war with Russia

The information age

1970s: CAD/CAM
1977: Princeton University: Statistics Professor John Tukey Developed the first exploratory data analysis (EDA) using visualizations.
1980s: Scientific and business visualizations (Harvard Graphics)
1983: Edward Tufte published “The Visual Display of Quantitative Information” which showed effective visualization methods.
1984: Apple Computer introduced the first popular and affordable computer that focused in graphics (GUI) as a mode of interaction and display. This was huge and persists today. (Modified from version at IBM)
1990s: Excel, Powerpoint, R
1999: The words, “information visualization” were so first named in the book: “Readings in Information Visualization: Using Vision to Think”, Card, Mackinlay, Shneiderman
Data visualization is now so ubiquitous that it’s hard to imagine a time without it

Why is data visualization important?

Humans are visual learners

Most people are inherently visual
We have a highly developed visual cortex
Visualizations take advantage of our innate capability to understand visual patterns quickly and intuitively
Visualizations also take advantage of our ability to detect weird patterns or aberrations

When do we use visualization?

Record information

We can compress a huge amount of information into a condensed visual representation(e.g., blueprints, photographs, seismographs, maps)

Analyze and explore

Exploratory visualization

Helps to develop and assess hypotheses
Identifying a good model for the data
Find patterns and spot trends, discover errors

Communicate

Explanatory visualization

Tell stories, share, inform, persuade
Collaborate and revise ¹

Why do we want to visualize data?

To be able to show patterns and relationships vividly
To provide insight and knowledge
- Not just throw together some representation of the data
- We’re translating data into a visual medium to communicate some concept or idea to our audience
Take advantage of our natural visual strengths and pattern recognition ability

Typical summaries

Averages
Variances
Correlations
Regression lines

These are examples of data compression, where we’re “squeezing” the data into a few numbers
- Unfortunately these are typically lossy compression methods
- You can’t recover the data from these because
- Several data sets can give you the same summaries

Summaries don’t differentiate

Anscombe (1973) created these toy examples
Averages of x and y are the same
Correlation between x and y are the same
Relationships are VERY different

A modern take: the datasaurus

Regardless of pattern, the points have the same marginal means and variances and the same Pearson correlation!!

Our choice of visualization determines what is revealed

Three varying 1D distributions of data, all with the same boxplot representation.

Note that as the data changes, both the histogram and strip plot reflects the changes, but the boxplot doesn’t

Our choice of visualization determines what is revealed

Seven distributions of data, shown as raw data points (of strip-plots), as box plots, and as violin plots

Last 3 figs: Matejka & Fitzmaurice (2017)

You see something similar here. The violin plot can show distributional changes while the boxplot can’t

Is this really a problem for using boxplots?

Not really. Most of the time we’re trying to show differences in location (mean,median) rather than differences in distribution. The boxplot does perfectly well showing those differences

Simpson’s paradox

Code

suppressPackageStartupMessages(library(tidyverse, quietly = T, warn.conflicts = F))
suppressPackageStartupMessages(library(MASS, quietly = T, warn.conflicts = F))
set.seed(5)
d <- list()
d[['Group 1']] <- as.data.frame(
  mvrnorm(n=1000, mu = c(0,0), Sigma=matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 2']] <- as.data.frame(
  mvrnorm(n = 1000, mu = c(3,3), Sigma = matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 3']] <- as.data.frame(
  mvrnorm(n=1000, mu = c(6,6), Sigma = matrix(c(2,-0.7, -0.7,2), ncol=2))
)
D <- bind_rows(d, .id = 'Group')

theme_set(theme_bw())
ggplot(D, aes(V1, V2)) + geom_point() + geom_smooth(method='lm', color ='red')
ggplot(D, aes(V1, V2)) + geom_point(aes(color = Group), show.label=F) +
  geom_smooth(aes(color = Group), se=F, method='lm')

A trend or result that is present when the data is put into groups that reverses or disappears when the data is combined

We visualize to understand our models and their performance

We visualize to tell a story

^{Note: We will re-create this visualization later in the class}

We visualize to help with communicating our work to our stakeholders!

Data alone is not insight

Humans can tell a better story than data can by itself
Numbers rarely speak for themeselves and need context

Data scientists with storytelling skills have greater business impact:

Influencing for impact comes down to conveying a compelling narrative around data and what it means
Data scientists who develop this skill typically have an endge in getting their work noticed and acted upon

In your work as data scientists, in addition to doing modeling and machine learning work, you will be responsible -either individually or as part of a team- for providing the following as part of a project:

Findings: what does the data say?
Conclusions: what is your interpretation of the data?
Recommendations: what can be done as a result of the findings and conclusions?

What aspects of this does data visualization support?

Case study

You are a newly hired data scientist with visualization superpowers, expert coding skills in R, Python, Excel (Visual Basic), SAS, SPSS, Tableau, PowerBI
You are given a dataset and you begin to explore it
You kind of know what you’re looking for, but you don’t know what you’re going to find yet
You work with your bag of tools through the available resources

You create a SUPER AWESOME visualization

And the reaction you get is…

Which makes you feel like

So how can we do better?

The things not to do

Make your audience perform mental gymnastics
Use colors that are indistinguishable for a segment of your audience
Doing a poor job of designing your visualization
- A poor choice of the visualization type
- No titles, labels, legends or annotations
- Using cryptic variable names instead of English/French/Swedish/Chinese
- Chartjunk!!
- Increasingly, not making your visualization mobile-friendly
… and the list can go on, and on, and on…

The audience’s reaction should never be “what am I looking at?” If the visualization requires more than a sentence to explain, it isn’t good. Ryan Rosario

How do we want to visualize

Begin with the audience in mind (to bastardize Stephen Covey)
Design with intention
- Know your story and tell it

There is a mental model for how to create a data visualization. It’s called the Grammar of Graphics.

If you know ggplot2, you already know what this is.

Otherwise, stay tuned.

Many software packages incorporate the Grammar of Graphics. Of course, ggplot2 is the grand-daddy of them, but seaborn.objects, plotly, altair, vega and others also implement it.

The things to get right

Visual encoding of data, or how different pieces of data are represented on the canvas
Visual integrity and honesty
Knowing your audience
Making good design choices to make visualizations clearer and more impactful
… and for us, reproducibility

Who cares, right?

You MUST care about your different audiences

Readers who land on your may not have the same luxury of developing and answering questions like you did
Your audience wants to know the story, conclusions, and/or results; they don’t want to analyze the data - that’s your job!

Visualization for Analysis

visualizations for you and your team
team and audience knows context
tool for understanding datasets
iterate quickly to develop insights
rough drafts
can make changes later

Visualization for Presentation

audience external to you and team
content is likely new and audience has no context
designed to communicate useful information
takes significant more time
publication ready

Visualization is an iterative process

Practice, practice, practice

So how do we make a great visualization?

Data <—> Grammar

Data visualization is as valuable to anyone working with data as grammar is to someone working with words
Just as you should not write an essay without proper grammar, you should not create a graph without first mastering data visualization best practices

Data visualization is part art, part science

Jules Morgan (https://ime.springerhealthcare.com/art-vs-science-in-a-global-pandemic/)

There are several sets of principles for good visualization design

Nathan Yau

Adjustment rules

Explain the encodings
Provide context
Focus on readability
Develop aesthetics

7 basic rules for making charts and graphs

Check the data
Explain encodings
Label axes
Include units
Keep your geometry in check
Include your sources
Consider your audience

Ed Tufte

Integrity principles

Show data variation, not design variation
Do not use graphics to quote data out of context
Use clear, detailed, thorough labeling
Representation of numbers should be directly proportional to numerical quantities
Don’t use more dimensions than the data require

Design principles

Show comparisons
Show causality
Use multivariate data
Completely integrate modes (like text, images, numbers)
Establish credibility
Focus on content

Noah Illinsky

Four pillars of visualization

A succesful visualization:

Has clear purpose (why this visualization)
Includes (only) the relevant content (what to visualize)
Uses appropriate structure (how to visualize it)
Has useful formatting (everything else)

Nathan’s seven rules details

1. Check the data

This is obvious, if your data is weak, your chart is weak
Start with simple graphs to see if there are any outliers.

2. Explain encodings

Don’t assume the reader knows what everything means, graphs should have captions!
Provide a legend
Label the marker shapes
Explain color scale

Nathan’s seven rules details (cont’d)

3. Label axes

Axes without labels or explanation are just decoration
Describe the scale (incremental, exponential, logarithmic?)
If possible, have axes values start at zero (don’t omit the base-line)

4. Include units:

Numbers without units are meaningless
Remove the guesswork

Nathan’s seven rules details (cont’d)

5. Keep your geometry in check

This is something that is immediately noticeable
Don’t use area to compare two units unless they are an area. An increase in a unit squares the area.
Tip: size circles and other 2D shapes by area, unless it’s a bar chart.

6. Include your sources

This is another obvious one, always include the source of your data!
This makes your graphic more reputable and allows others to dig deeper

Nathan’s seven rules details (cont’d)

7. Consider your audience

What purpose do your charts have and who are they for?
Avoid quirky fonts
Make good design choices

A mental model: The Grammar of Graphics

Historical context

1994: William S. Cleveland, The Elements of Graphing Data, lists the “basic elements of graph construction” as scales, captions, plotting symbols, reference lines, keys, labels, panels, and tick marks.
1999: L. Wilkinson’s book The Grammar of Graphics defines ‘components of a graphic’:
- Structured framework utilizing a layered approach to describe and construct visualizations
2009: ggplot2: Elegant Graphics for Data Analysis, Book by Hadley Wickham.

What is it?

The Grammar of graphics is a conceptual framework for thinking about graphics
It provides a hierarchy of elements to deconstruct and understand figure design

Acknowledgements

Various instructors around the world who have inspired, and have made their materials available in an open-source manner
- Jeff Heer
- Tamara Munzner
- Arvind Satyanarayan
- many others
Our co-instructors at Georgetown University
Many many individuals with whom we’ve discussed data visualization and best practices

Lecture 1

Agenda and Goals for Today

Lecture

Lab

Your instructors

Marck Vaisman

Anderson Monken

Abhijit Dasgupta

Jeff Jacobs

About the Course

Course description

Learning objectives

Our roadmap

Data types

Getting there

Software

Expectations

We have high expectations and strong opinions!

It ain’t that easy

Let’s avoid this

Bookmark these links!

We’re gonna have an awesome time!

Brief history of data visualization

Where did it start?

The scientific revolution

The industrial revolution

Industrial revolution

War time

The information age

Why is data visualization important?

Humans are visual learners

When do we use visualization?

Record information

Analyze and explore

Communicate

Why do we want to visualize data?

Typical summaries

Summaries don’t differentiate

A modern take: the datasaurus

Our choice of visualization determines what is revealed

Three varying 1D distributions of data, all with the same boxplot representation.

Our choice of visualization determines what is revealed

Seven distributions of data, shown as raw data points (of strip-plots), as box plots, and as violin plots

Simpson’s paradox

We visualize to understand our models and their performance

We visualize to tell a story

We visualize to help with communicating our work to our stakeholders!

Data alone is not insight

Data scientists with storytelling skills have greater business impact:

Case study

You create a SUPER AWESOME visualization

And the reaction you get is…

Which makes you feel like

So how can we do better?

The things not to do

How do we want to visualize

The things to get right

Who cares, right?

You MUST care about your different audiences

Visualization for Analysis

Visualization for Presentation

Visualization is an iterative process

Practice, practice, practice

So how do we make a great visualization?

Data <—> Grammar