Course intro, expectations, why we visualize, gallery of bad visualizations, designing for an audience, visualization best-practices
Georgetown University
Spring 2024
ggplot
, matplotlib
and seaborn
reviewFun Facts
Fun Facts
Fun Facts
Fun Facts
This course explores the art and science of data visualization from the ground up.
We discuss the power of visualizations to communicate data in an engaging, informative, and accessible way. Drawing from statistics, graphic design, and computer and information science techniques, you will learn to think critically about data, choose optimal static and dynamic visualization methods, recognize dishonest visualization techniques, and distinguish between visualizations used for data exploration and publication.
We cover popular visualization libraries in R and Python, front-end web development tools for digital publishing with D3.js and other JavaScript libraries, and commercial applications like Tableau.
By the end of the course, you will have a deep understanding of data visualization techniques, design principles for generating publication-quality graphics, manipulating data for analysis and visualization, and designing compelling visualizations for different purposes and audiences.
ggplot2
, matplotlib/seaborn
, plotly
,Observable Plot
, Vega/VegaLite/Altair
, and others.We will mainly use two scripting languages in this course:
pandas
, matplotlib
, seaborn
.Given we are in 2024:
plotly
and Observable Plot
as well as R and Python wrappersAnd for those who want point-and-click (but how to make reproducible?):
ggplot2
, ggthemes
, hrbrthemes
, matplotlib
, seaborn
or any other package we have covered in the course.You will develop your own visual theme. No default settings are permitted, period. Default settings are not usually the best. Make your visualization both beautiful and useful, and your own.
YOUR WORK MUST BE PUBLICATION READY!
Data visualization is neither simple nor easy. It requires thought, creativity and understanding both the data and the topic/aspect/answer you are trying to express through the visualization. Give yourself ample time to think, explore and experiment. It is very easy for us to determine when you didn’t give yourself enough time and “phoned it in”.
These will also be pinned on the Slack main
channel
1854: John Snow cholera clusters in London
1856: Florence Nightingale: Causes of mortality in the Army in the East
1869: Minard: Flow of Napoleon’s army during his war with Russia
Most people are inherently visual
We have a highly developed visual cortex
Visualizations take advantage of our innate capability to understand visual patterns quickly and intuitively
Visualizations also take advantage of our ability to detect weird patterns or aberrations
Exploratory visualization
Explanatory visualization
To be able to show patterns and relationships vividly
To provide insight and knowledge
Take advantage of our natural visual strengths and pattern recognition ability
Regardless of pattern, the points have the same marginal means and variances and the same Pearson correlation!!
Note that as the data changes, both the histogram and strip plot reflects the changes, but the boxplot doesn’t
Last 3 figs: Matejka & Fitzmaurice (2017)
You see something similar here. The violin plot can show distributional changes while the boxplot can’t
Is this really a problem for using boxplots?
Not really. Most of the time we’re trying to show differences in location (mean,median) rather than differences in distribution. The boxplot does perfectly well showing those differences
suppressPackageStartupMessages(library(tidyverse, quietly = T, warn.conflicts = F))
suppressPackageStartupMessages(library(MASS, quietly = T, warn.conflicts = F))
set.seed(5)
d <- list()
d[['Group 1']] <- as.data.frame(
mvrnorm(n=1000, mu = c(0,0), Sigma=matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 2']] <- as.data.frame(
mvrnorm(n = 1000, mu = c(3,3), Sigma = matrix(c(2,-0.7,-0.7,2), ncol=2))
)
d[['Group 3']] <- as.data.frame(
mvrnorm(n=1000, mu = c(6,6), Sigma = matrix(c(2,-0.7, -0.7,2), ncol=2))
)
D <- bind_rows(d, .id = 'Group')
theme_set(theme_bw())
ggplot(D, aes(V1, V2)) + geom_point() + geom_smooth(method='lm', color ='red')
ggplot(D, aes(V1, V2)) + geom_point(aes(color = Group), show.label=F) +
geom_smooth(aes(color = Group), se=F, method='lm')
A trend or result that is present when the data is put into groups that reverses or disappears when the data is combined
Note: We will re-create this visualization later in the class
In your work as data scientists, in addition to doing modeling and machine learning work, you will be responsible -either individually or as part of a team- for providing the following as part of a project:
What aspects of this does data visualization support?
The audience’s reaction should never be “what am I looking at?” If the visualization requires more than a sentence to explain, it isn’t good. Ryan Rosario
Begin with the audience in mind (to bastardize Stephen Covey)
Design with intention
There is a mental model for how to create a data visualization. It’s called the Grammar of Graphics.
If you know
ggplot2
, you already know what this is.Otherwise, stay tuned.
Many software packages incorporate the Grammar of Graphics. Of course, ggplot2
is the grand-daddy of them, but seaborn.objects
, plotly
, altair
, vega
and others also implement it.
Well-known influencer in the data visualization community, R guru, and creator of FlowingData.
Statistician, professor and pioneer in the field of information design and data visualization.
Adjustment rules
7 basic rules for making charts and graphs
Integrity principles
Design principles
Four pillars of visualization
A succesful visualization:
1994: William S. Cleveland, The Elements of Graphing Data, lists the “basic elements of graph construction” as scales, captions, plotting symbols, reference lines, keys, labels, panels, and tick marks.
1999: L. Wilkinson’s book The Grammar of Graphics defines ‘components of a graphic’:
2009: ggplot2: Elegant Graphics for Data Analysis, Book by Hadley Wickham.
DSAN 5200 | Spring 2024 | https://gu-dsan.github.io/5200-spring-2024/