Lecture 1

Course intro, expectations, why we visualize, gallery of bad visualizations, designing for an audience, visualization best-practices

Abhijit Dasgupta, Jeff Jacobs, Anderson Monken, and Marck Vaisman

Georgetown University

Spring 2024

Agenda and Goals for Today

Lecture

  • Course introduction and expectations
  • Why we visualize
  • Thinking of our audience
  • Start thinking of best practices
  • What NOT to do (one of many)

Lab

  • Data wrangling for visualization
  • ggplot, matplotlib and seaborn review

Your instructors

Marck Vaisman

  • AI & ML Cloud Architect and Data Scientist at Microsoft
  • Teaching at Georgetown since 2016 & GWU since 2015
  • Co-Founder of DataCommunityDC
  • R Fanatic

Fun Facts

  • I love music and try to play music at the beginning of class, typically EDM. Other genres I love are Latin, Bluegrass and Chill
  • I speak fluent Spanish, I grew up in Venezuela
  • Love beer & bourbon
  • Goofball
  • Westie owner
  • I can speak like Donald Duck

Anderson Monken

  • Data Science Manager at Federal Reserve Board of Governors
    • Team uses big data, web development, software development, machine learning, and AI
    • Technology and cloud initiatives
    • Research focus in international trade and economics
  • Adjunct Professor since 2022
  • DSAN Program Graduate

Fun Facts

  • Amateur car mechanic on several old BMWs
  • Can solve a Rubik’s Cube in under a minute
  • Canoe’d over 200 miles in Canada

Abhijit Dasgupta

  • Data Science Associate Director at AstraZeneca supporting Oncology R&D
    • bioinformatics, biomarkers, clinical studies
    • autoencoders, survival analysis, signal processing
  • Adjunct Professor at Georgetown since 2020
  • R and reproducible research evangelist
    • Python is cool too!!
  • Co-founder of Statistical Programming DC (with Marck Vaisman)

Fun Facts

  • I’m a 4th degree black belt in Aikido,
    • over 30 years experience providing flyer miles
  • Exploring global whiskey, currently on Japan
  • Active in community theater, mainly behind but sometimes on-stage.

Jeff Jacobs

  • Assistant Teaching Professor since 2023
  • Finished up PhD in Political Economy from Columbia University in NYC in 2022
  • Research papers: NLP for empirical studies of labor economics, game theory for normative models of justice/domination/exploitation in labor markets
  • Dissertation: Using NLP to study history of political thought as a series of rhetorical “wars of words”, from 18th century to present

Fun Facts

  • Born and raised ~5 minutes east of Georgetown campus!
  • Passion project: Teaching CS + Design Thinking classes each summer in refugee camps in Gaza, the West Bank, and one time in northern Syria
  • Favorite hobbies: Making moody computer music, reading books that I don’t have to read to procrastinate reading books that I do have to read
  • Lover of all animals but especially my cat Biko

About the Course

Course description

This course explores the art and science of data visualization from the ground up.

We discuss the power of visualizations to communicate data in an engaging, informative, and accessible way. Drawing from statistics, graphic design, and computer and information science techniques, you will learn to think critically about data, choose optimal static and dynamic visualization methods, recognize dishonest visualization techniques, and distinguish between visualizations used for data exploration and publication.

We cover popular visualization libraries in R and Python, front-end web development tools for digital publishing with D3.js and other JavaScript libraries, and commercial applications like Tableau.

By the end of the course, you will have a deep understanding of data visualization techniques, design principles for generating publication-quality graphics, manipulating data for analysis and visualization, and designing compelling visualizations for different purposes and audiences.

Learning objectives

  • Increase your data visualization vocabulary
  • Understand what comes before and after creating visualizations
  • Think critically about data
  • Distinguish between using visualizations for data exploration or presentation
  • Manipulate and arrange data for the purposes of analysis and visualization
  • Design effective visualizations for different purposes and audiences
  • Apply a set of rules to create highly effective and engaging data visualizations
  • Understand the role of data visualization within data science and for solving problems
  • Build dynamic visualizations designed for digital and interactive consumption
  • Use an array of tools including R, Python, Tableau, Javascript and web tools (HTML, CSS, etc.) to execute all of the above. This includes particular packages and libraries like ggplot2, matplotlib/seaborn, plotly,Observable Plot, Vega/VegaLite/Altair, and others.

Our roadmap

Data types

  • Quantitative
  • Qualitative or discrete
  • Dates and times
  • Text
  • Maps, geographic, location based
  • Relationships (graphs and networks)
  • Combinations of all

Getting there

  • Static visualizations
  • Exploratory data analysis
  • Interactive visualiztions
  • Storytelling and visual narratives
  • Dashboarding

Software

We will mainly use two scripting languages in this course:

  • R
    • In particular, we will make heavy use of the tidyverse group of packages and its philosophy
  • Python

Given we are in 2024:

  • D3.js primarily via higher level Javascript libraries like plotly and Observable Plot as well as R and Python wrappers

And for those who want point-and-click (but how to make reproducible?):

  • Tableau, a popular data visualization software.

Expectations

We have high expectations and strong opinions!

  • This is a professional, graduate level program.
  • You must use excellent design practices and make choices that support them for a clean and professional presentation.
  • Your spelling and grammar must be accurate.
  • All work should look professional, and have requisite titles, labels, source references, annotations and other textual elements in regular English
    • No data-variable or data-names are allowed. Titles and legends must show context.
    • They should also be thematically in a style you have created, not in one of the default styles found in ggplot2, ggthemes, hrbrthemes, matplotlib, seaborn or any other package we have covered in the course.
    • Visualizations should also be understandable on their own. Brief legends and annotations are allowed.

You will develop your own visual theme. No default settings are permitted, period. Default settings are not usually the best. Make your visualization both beautiful and useful, and your own.

YOUR WORK MUST BE PUBLICATION READY!

It ain’t that easy

Data visualization is neither simple nor easy. It requires thought, creativity and understanding both the data and the topic/aspect/answer you are trying to express through the visualization. Give yourself ample time to think, explore and experiment. It is very easy for us to determine when you didn’t give yourself enough time and “phoned it in”.

  • Starting out is easy, especially with the current software tools.
  • However, getting to presentation/publication quality takes a lot of effort and/or tweaking

Let’s avoid this

We’re gonna have an awesome time!

Brief history of data visualization

Where did it start?

  • The original “data visualizers” were map makers, astronomers and navigators

200 AD: Ptolemy

200 AD: Ptolemy

1569 AD: Gerardus Mercator

1569 AD: Gerardus Mercator

The scientific revolution

1543: Nicolaus Copernicus

1543: Nicolaus Copernicus

1637: Rene Descartes

1637: Rene Descartes

The industrial revolution

1854: John Snow cholera clusters in London

  • Geospatial mapping of homes with cholera cases
  • One of the first cases of spatial epidemiology
  • Visually identified clusters of incident cases around a particular water source

Industrial revolution

1856: Florence Nightingale: Causes of mortality in the Army in the East

Convince generals that most deaths were preventable and not directly from war wounds

War time

1869: Minard: Flow of Napoleon’s army during his war with Russia

The information age

  • 1970s: CAD/CAM
  • 1977: Princeton University: Statistics Professor John Tukey Developed the first exploratory data analysis (EDA) using visualizations.
  • 1980s: Scientific and business visualizations (Harvard Graphics)
  • 1983: Edward Tufte published “The Visual Display of Quantitative Information” which showed effective visualization methods.
  • 1984: Apple Computer introduced the first popular and affordable computer that focused in graphics (GUI) as a mode of interaction and display. This was huge and persists today. (Modified from version at IBM)
  • 1990s: Excel, Powerpoint, R
  • 1999: The words, “information visualization” were so first named in the book: “Readings in Information Visualization: Using Vision to Think”, Card, Mackinlay, Shneiderman
  • Data visualization is now so ubiquitous that it’s hard to imagine a time without it

Why is data visualization important?

Humans are visual learners

  • Most people are inherently visual

  • We have a highly developed visual cortex

  • Visualizations take advantage of our innate capability to understand visual patterns quickly and intuitively

  • Visualizations also take advantage of our ability to detect weird patterns or aberrations

When do we use visualization?

Record information

  • We can compress a huge amount of information into a condensed visual representation(e.g., blueprints, photographs, seismographs, maps)

Analyze and explore

Exploratory visualization

  • Helps to develop and assess hypotheses
  • Identifying a good model for the data
  • Find patterns and spot trends, discover errors

Communicate

Explanatory visualization

  • Tell stories, share, inform, persuade
  • Collaborate and revise 1

Why do we want to visualize data?

  • To be able to show patterns and relationships vividly

  • To provide insight and knowledge

    • Not just throw together some representation of the data
    • We’re translating data into a visual medium to communicate some concept or idea to our audience
  • Take advantage of our natural visual strengths and pattern recognition ability

Typical summaries

  • Averages
  • Variances
  • Correlations
  • Regression lines
  • These are examples of data compression, where we’re “squeezing” the data into a few numbers
    • Unfortunately these are typically lossy compression methods
    • You can’t recover the data from these because
    • Several data sets can give you the same summaries

Summaries don’t differentiate

  • Anscombe (1973) created these toy examples
  • Averages of x and y are the same
  • Correlation between x and y are the same
  • Relationships are VERY different

A modern take: the datasaurus

Regardless of pattern, the points have the same marginal means and variances and the same Pearson correlation!!

Our choice of visualization determines what is revealed

Three varying 1D distributions of data, all with the same boxplot representation.

Note that as the data changes, both the histogram and strip plot reflects the changes, but the boxplot doesn’t

Our choice of visualization determines what is revealed

Seven distributions of data, shown as raw data points (of strip-plots), as box plots, and as violin plots

Last 3 figs: Matejka & Fitzmaurice (2017)

You see something similar here. The violin plot can show distributional changes while the boxplot can’t

Is this really a problem for using boxplots?

Not really. Most of the time we’re trying to show differences in location (mean,median) rather than differences in distribution. The boxplot does perfectly well showing those differences

Simpson’s paradox

Code
suppressPackageStartupMessages(library(tidyverse, quietly = T, warn.conflicts = F))
suppressPackageStartupMessages(library(MASS, quietly = T, warn.conflicts = F))
set.seed(5)
d <- list()
d[['Group 1']] <- as.data.frame(
  mvrnorm(n=1000, mu = c(0,0), Sigma=matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 2']] <- as.data.frame(
  mvrnorm(n = 1000, mu = c(3,3), Sigma = matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 3']] <- as.data.frame(
  mvrnorm(n=1000, mu = c(6,6), Sigma = matrix(c(2,-0.7, -0.7,2), ncol=2))
)
D <- bind_rows(d, .id = 'Group')

theme_set(theme_bw())
ggplot(D, aes(V1, V2)) + geom_point() + geom_smooth(method='lm', color ='red')
ggplot(D, aes(V1, V2)) + geom_point(aes(color = Group), show.label=F) +
  geom_smooth(aes(color = Group), se=F, method='lm')

A trend or result that is present when the data is put into groups that reverses or disappears when the data is combined

We visualize to understand our models and their performance

We visualize to tell a story

Note: We will re-create this visualization later in the class

We visualize to help with communicating our work to our stakeholders!

Data alone is not insight

  • Humans can tell a better story than data can by itself
  • Numbers rarely speak for themeselves and need context

Data scientists with storytelling skills have greater business impact:

  • Influencing for impact comes down to conveying a compelling narrative around data and what it means
  • Data scientists who develop this skill typically have an endge in getting their work noticed and acted upon

In your work as data scientists, in addition to doing modeling and machine learning work, you will be responsible -either individually or as part of a team- for providing the following as part of a project:

  • Findings: what does the data say?
  • Conclusions: what is your interpretation of the data?
  • Recommendations: what can be done as a result of the findings and conclusions?

What aspects of this does data visualization support?

Case study

  • You are a newly hired data scientist with visualization superpowers, expert coding skills in R, Python, Excel (Visual Basic), SAS, SPSS, Tableau, PowerBI
  • You are given a dataset and you begin to explore it
  • You kind of know what you’re looking for, but you don’t know what you’re going to find yet
  • You work with your bag of tools through the available resources

You create a SUPER AWESOME visualization

And the reaction you get is…

Which makes you feel like

So how can we do better?

The things not to do

  • Make your audience perform mental gymnastics
  • Use colors that are indistinguishable for a segment of your audience
  • Doing a poor job of designing your visualization
    • A poor choice of the visualization type
    • No titles, labels, legends or annotations
    • Using cryptic variable names instead of English/French/Swedish/Chinese
    • Chartjunk!!
    • Increasingly, not making your visualization mobile-friendly
  • … and the list can go on, and on, and on…

The audience’s reaction should never be “what am I looking at?” If the visualization requires more than a sentence to explain, it isn’t good. Ryan Rosario

How do we want to visualize

  • Begin with the audience in mind (to bastardize Stephen Covey)

  • Design with intention

    • Know your story and tell it

There is a mental model for how to create a data visualization. It’s called the Grammar of Graphics.

If you know ggplot2, you already know what this is.

Otherwise, stay tuned.

Many software packages incorporate the Grammar of Graphics. Of course, ggplot2 is the grand-daddy of them, but seaborn.objects, plotly, altair, vega and others also implement it.

The things to get right

  • Visual encoding of data, or how different pieces of data are represented on the canvas
  • Visual integrity and honesty
  • Knowing your audience
  • Making good design choices to make visualizations clearer and more impactful
  • … and for us, reproducibility

Who cares, right?

You MUST care about your different audiences

  • Readers who land on your may not have the same luxury of developing and answering questions like you did
  • Your audience wants to know the story, conclusions, and/or results; they don’t want to analyze the data - that’s your job!

Visualization for Analysis

  • visualizations for you and your team
  • team and audience knows context
  • tool for understanding datasets
  • iterate quickly to develop insights
  • rough drafts
  • can make changes later

Visualization for Presentation

  • audience external to you and team
  • content is likely new and audience has no context
  • designed to communicate useful information
  • takes significant more time
  • publication ready

Visualization is an iterative process

Practice, practice, practice

So how do we make a great visualization?

Data <—> Grammar

  • Data visualization is as valuable to anyone working with data as grammar is to someone working with words

  • Just as you should not write an essay without proper grammar, you should not create a graph without first mastering data visualization best practices

Some influential figures in data visualization

Nathan Yau

Well-known influencer in the data visualization community, R guru, and creator of FlowingData.

Ed Tufte

Statistician, professor and pioneer in the field of information design and data visualization.

Noah Illinsky

Visualization and information designer, UX architect,

Data visualization is part art, part science

Jules Morgan (https://ime.springerhealthcare.com/art-vs-science-in-a-global-pandemic/)

Jules Morgan (https://ime.springerhealthcare.com/art-vs-science-in-a-global-pandemic/)

There are several sets of principles for good visualization design

Nathan Yau

Adjustment rules

  • Explain the encodings
  • Provide context
  • Focus on readability
  • Develop aesthetics

7 basic rules for making charts and graphs

  1. Check the data
  2. Explain encodings
  3. Label axes
  4. Include units
  5. Keep your geometry in check
  6. Include your sources
  7. Consider your audience

Ed Tufte

Integrity principles

  • Show data variation, not design variation
  • Do not use graphics to quote data out of context
  • Use clear, detailed, thorough labeling
  • Representation of numbers should be directly proportional to numerical quantities
  • Don’t use more dimensions than the data require

Design principles

  • Show comparisons
  • Show causality
  • Use multivariate data
  • Completely integrate modes (like text, images, numbers)
  • Establish credibility
  • Focus on content

Noah Illinsky

Four pillars of visualization

A succesful visualization:

  1. Has clear purpose (why this visualization)
  2. Includes (only) the relevant content (what to visualize)
  3. Uses appropriate structure (how to visualize it)
  4. Has useful formatting (everything else)

Nathan’s seven rules details

1. Check the data

  • This is obvious, if your data is weak, your chart is weak
  • Start with simple graphs to see if there are any outliers.

2. Explain encodings

  • Don’t assume the reader knows what everything means, graphs should have captions!
  • Provide a legend
  • Label the marker shapes
  • Explain color scale

Nathan’s seven rules details (cont’d)

3. Label axes

  • Axes without labels or explanation are just decoration
  • Describe the scale (incremental, exponential, logarithmic?)
  • If possible, have axes values start at zero (don’t omit the base-line)

4. Include units:

  • Numbers without units are meaningless
  • Remove the guesswork

Nathan’s seven rules details (cont’d)

5. Keep your geometry in check

  • This is something that is immediately noticeable
  • Don’t use area to compare two units unless they are an area. An increase in a unit squares the area.
  • Tip: size circles and other 2D shapes by area, unless it’s a bar chart.

6. Include your sources

  • This is another obvious one, always include the source of your data!
  • This makes your graphic more reputable and allows others to dig deeper

Nathan’s seven rules details (cont’d)

7. Consider your audience

  • What purpose do your charts have and who are they for?
  • Avoid quirky fonts
  • Make good design choices

A mental model: The Grammar of Graphics

Historical context

  • 1994: William S. Cleveland, The Elements of Graphing Data, lists the “basic elements of graph construction” as scales, captions, plotting symbols, reference lines, keys, labels, panels, and tick marks.

  • 1999: L. Wilkinson’s book The Grammar of Graphics defines ‘components of a graphic’:

    • Structured framework utilizing a layered approach to describe and construct visualizations
  • 2009: ggplot2: Elegant Graphics for Data Analysis, Book by Hadley Wickham.

What is it?

  • The Grammar of graphics is a conceptual framework for thinking about graphics
  • It provides a hierarchy of elements to deconstruct and understand figure design

Acknowledgements

  • Various instructors around the world who have inspired, and have made their materials available in an open-source manner
  • Our co-instructors at Georgetown University
  • Many many individuals with whom we’ve discussed data visualization and best practices

Lab