Lecture 11

Model-based visualizations: Machine learning

Abhijit Dasgupta, Jeff Jacobs, Anderson Monken, and Marck Vaisman

Georgetown University

Spring 2024

What we’ve learned

Principles of visual design
Visual encodings and marks
Keeping the audience in mind
Better design choices
Using visualization for exploratory data analysis
Data wrangling for visualization and more
Visual narratives and the role of interactive graphics
Creating interactive visualizations using altair and plotly (which leverage JavaScript)
Exploring d3.js and its ability to provide granular control over interactive graphics
Efficient plotting for Big Data using rasterization, bin-summarize methods, and incorporation into interactive visualizations
Working with geospatial data
Working with network or graph data

Today

Modeling and visualizing results

Unsupervised learning
- Correlograms
- Hierarchical clustering
- Heatmaps
- Dimension reduction and transformations (PCA, t-SNE, UMAP)
Supervised learning
- Regression models
- Decision trees
- Model introspection
- Model predictive performance
- Survival/reliability models

Visualizing the results of models

We’ve talked a lot about exploratory and descriptive visualizations to understand the unprocessed data

Modeling is part of most insight strategies to understand patterns in the data. These are often the “end results” of a data science project and the “punch line” of what you’re presenting

Visualizing results rather than a tabular presentation can be more impactful
Take advantage of receivers’ innate pattern recognition capabilities
Ability to highlight main stories out of the models

Visualizing the results of models

Easier to not only show results but also highlight weaknesses
- Fitting a straight line to a curvilinear relationship
Easy to put too much on a graphic
- Need restraint

Code

set.seed(4)
library(survival)
library(rpart)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )

# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))

library(partykit)
tfit2 <- as.party(tfit)
plot(tfit2)

Code

data <- as.matrix(mtcars)
library(RColorBrewer)
heatmap(data, scale='column', cexRow=1, col=colorRampPalette(brewer.pal(8,'Blues'))(25))

Code

x = seq(0, 1, length.out = 100)
y = 3 + 2*x^2 + rnorm(100, sd=.1)
ggplot2::qplot(x, y)+
  geom_smooth(method='lm', color='red')+
  theme_minimal()+
  theme(axis.text=element_blank(),axis.title=element_blank())

Toolboxes

R

ggplot2 and family
ggeffects

base (unsupervised learning, regression)
broom (tidy-fying results)
caret/parsnip (supervised learning)
party/partykit (decision trees)

Python

matplotlib
yellowbrick

statsmodels (statistical mdoels)
scikit-learn (machine learning)

Unsupervised learning models

Unsupervised learning are machine learning models where we’re looking to discover patterns within unlabeled data.

This can be considered a form of clustering or pattern recognition

Unsupervised learning can be used

to find groups in data where
- members within a group are more similar to each other
- than with members of another group
to detect anomalies/outliers

Unsupervised learning models

Unsupervised learning can be used to

detect subgroups of cancer patients based on biology and biomarkers
detect abnormal packets on the internet
characterize different customer profiles
identify subtle problems on x-rays and medical images
characterize different kinds of galaxies

Unsupervised learning models

Some common methods in unsupervised learning

Cluster analysis
- Hierarchical clustering
- k-means clustering
- Density estimation / mixture models
- Self-organizing maps
- Voronoi diagrams

Data transformations for pattern recognition
- PCA
- tSNE
- UMAP

Correlograms

Code

library(ggcorrplot)
library(mlbench)
data("PimaIndiansDiabetes2")
d <- PimaIndiansDiabetes2 %>% 
  select(glucose:triceps, mass:age)
cormat <- cor(d, use='pair')
ggcorrplot(cormat, method='circle')

Code

library(corrgram)
corrgram(d, order=TRUE, 
lower.panel=panel.shade,
upper.panel=panel.ellipse,
text.panel=panel.txt)

Using Pima diabetes data

Correlograms

Code

library(ellipse)
library(RColorBrewer)
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)

ord <- order(cormat[1,])
d_ord <- cormat[ord,ord]
plotcorr(d_ord, mar=c(1,1,1,1), col=my_colors[d_ord*50+50])

::::

Using Pima diabetes data

Hierarchical clustering

Agglomerative clustering

Start with each member in its own cluster
Join clusters based on distance between (linkage)
Visualize using trees

Hierarchical clustering

Agglomerative clustering

Start with each member in its own cluster
Join clusters based on distance between (linkage)
Visualize using trees

Euclidean distance (root mean square distance)
Correlation distance (1 - correlation)
Many other choices

Hierarchical clustering

Agglomerative clustering

Start with each member in its own cluster
Join clusters based on distance between (linkage)
Visualize using trees

Hierarchical clustering

Agglomerative clustering

Start with each member in its own cluster
Join clusters based on distance between (linkage)
Visualize using trees

More details here

Hierarchical clustering ()

Code

iris %>% janitor::clean_names() %>% 
  select(-species) %>% 
  dist() %>% 
  hclust(method='complete') -> iris_cl
  as.dendrogram(iris_cl) -> dend
plot(dend)

Code

library(dendextend)
dend %>% 
  set('branches_col', 'grey') %>% 
  set('branches_lwd', 3) %>% 
  set('labels_col','orange') %>% 
  set('labels_cex', 0.8) %>% 
  plot()

Hierarchical clustering ()

Code

dend %>%
  set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
  set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
  plot(axes=FALSE)
rect.dendrogram( dend, k=3, lty = 5, lwd = 0, which=3, col=rgb(0.1, 0.2, 0.4, 0.1) )

Code

# rgb(red,green,blue,alpha)

Code

dend %>%
  set("labels_col", value = c("skyblue", "orange", "grey"), k=3) %>%
  set("branches_k_color", value = c("skyblue", "orange", "grey"), k = 3) %>%
  plot(horiz=TRUE, axes=FALSE)
abline(v = 350, lty = 2)

Hierarchical clustering ()

Code

import pandas as pd
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy
import numpy as np
import seaborn as sns
 
# Data set
iris = sns.load_dataset('iris')


df = iris.iloc[:,:4]

# Calculate the distance between each sample
Z = hierarchy.linkage(df, 'average')
ind = hierarchy.cut_tree(Z,3)
 
# Plot with Custom leaves
hierarchy.set_link_color_palette(['m','b','g'])
hierarchy.dendrogram(Z, leaf_rotation=90, leaf_font_size=8, color_threshold=1.8, above_threshold_color='grey');
plt.show()

Code

# plt.savefig('`r img_path('dendro1.png')`')

Code

hierarchy.set_link_color_palette(['m','b','g'])
hierarchy.dendrogram(Z, orientation="left", color_threshold=1.8, above_threshold_color='grey' );
plt.show()

Comparing linkages

Code

d = iris %>% janitor::clean_names() %>% select(-species) %>% dist()
cl1 <- hclust(d, method='single')
cl2 = hclust(d, method='average')
cl3 = hclust(d, method='complete')

iris1 <- iris %>% janitor::clean_names() %>% mutate(ind1=factor(cutree(cl1, 3)), ind2=factor(cutree(cl2,3)),
                         ind3=factor(cutree(cl3,3)))
plt1 <- ggplot(iris1, aes(x=sepal_length,y=sepal_width, color=ind1))+geom_point()+
  stat_ellipse() + labs(x = 'sepal length',y='sepal width', title='Linkage: single')+
  theme(legend.position='none')
plt2 <- ggplot(iris1, aes(x=sepal_length,y=sepal_width, color=ind2))+geom_point()+
  stat_ellipse()+ labs(x = 'sepal length',y='sepal width', title='Linkage: average')+
  theme(legend.position='none')
plt3 <- ggplot(iris1, aes(x=sepal_length,y=sepal_width, color=ind3))+geom_point()+
  stat_ellipse()+ labs(x = 'sepal length',y='sepal width', title='Linkage: complete')+
  theme(legend.position='none')
library(cowplot)
plot_grid(plt1, plt2, plt3, ncol=2)

k-means clustering

Choose number of clusters first
Optimize clusters so that
- Within cluster variability is minimized
- Between cluster variability is maximized

Typically k-means clustering looks at spherical clusters

Code

iris %>% janitor::clean_names() %>% 
  select(-species) %>% 
  as.matrix() %>% 
  kmeans(centers=3)-> cl
iris %>% janitor::clean_names() %>% 
  select(-species) %>% 
  mutate(cl = as.factor(cl$cluster)) %>% 
  ggplot(aes(sepal_length, sepal_width, color=cl))+
    geom_point() + 
    stat_ellipse() +
    labs(x = 'Sepal length', y = 'Sepal Width')+
    theme(legend.position='none')

Evaluating cluster analysis: Silhouette plots

Silhouette statistic

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. (Wikipedia)

from sklearn.cluster import KMeans

from yellowbrick.cluster import silhouette_visualizer
from yellowbrick.datasets import load_credit

# Load a clustering dataset
X, y = load_credit()

# Specify rows to cluster: under 40 y/o and have either graduate or university education
X = X[(X['age'] <= 40) & (X['edu'].isin([1,2]))]

# Use the quick method and immediately show the figure
silhouette_visualizer(KMeans(5, random_state=42), X, colors='yellowbrick')

SilhouetteVisualizer(ax=<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 18941 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>,
                     colors='yellowbrick',
                     estimator=KMeans(n_clusters=5, random_state=42))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

More granular code (different data)

from sklearn.cluster import KMeans

from yellowbrick.cluster import SilhouetteVisualizer
from yellowbrick.datasets import load_nfl

# Load a clustering dataset
X, y = load_nfl()

# Specify the features to use for clustering
features = ['Rec', 'Yds', 'TD', 'Fmb', 'Ctch_Rate']
X = X.query('Tgt >= 20')[features]

# Instantiate the clustering model and visualizer
model = KMeans(5, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick')

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

Evaluating cluster analysis: Elbow plot

from sklearn.cluster import KMeans
from yellowbrick.cluster.elbow import kelbow_visualizer
from yellowbrick.datasets.loaders import load_nfl

X, y = load_nfl()

# Use the quick method and immediately show the figure
kelbow_visualizer(KMeans(random_state=4), X, k=(2,10))

KElbowVisualizer(ax=<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>,
                 estimator=KMeans(n_clusters=9, random_state=4), k=(2, 10))

More granular code

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from yellowbrick.cluster import KElbowVisualizer

# Generate synthetic dataset with 8 random clusters
X, y = load_nfl()

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(
    model, k=(2,10), metric='calinski_harabasz', timings=False
)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

Heatmaps

Code

data <- as.matrix(mtcars)
library(RColorBrewer)
heatmap(data, scale='column', cexRow=1, col=colorRampPalette(brewer.pal(8,'Blues'))(25))

Heatmaps

Using a complex heatmap to understand the prevalence of measles in the US over time and the effect of introducing the vaccine.

Code

source('ml/complexhm.R')

Heatmaps ()

Code

# Libraries
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
 
# Data set
url = 'https://raw.githubusercontent.com/holtzy/The-Python-Graph-Gallery/master/static/data/mtcars.csv'
df = pd.read_csv(url)
df = df.set_index('model')
 
# Prepare a vector of color mapped to the 'cyl' column
my_palette = dict(zip(df.cyl.unique(), ["orange","yellow","brown"]))
row_colors = df.cyl.map(my_palette)
 
# plot
sns.clustermap(df, metric="correlation", method="single", cmap="Blues", standard_scale=1, row_colors=row_colors);

PCA

PCA (principal components analysis) is a method to “rotate” the data based on variability of the data cloud, resulting in fewer variables that capture the information in the data cloud. It’s a popular method of dimension reduction

PCA first finds the direction in which there is maximum variance

Then finds the directional orthogonal (uncorrelated) to the first direction with the most variance

And so on

In this case, these two new variables capture 99.7% of the variability in the iris measurements

It’s not uncommon that PCA separates groups along one of the axes

Code

iris <- iris %>% janitor::clean_names()
iris = iris[!duplicated(iris),]
df <- iris %>% select(-species) %>% t()
pr = prcomp(df)
d <- as.data.frame(pr$rotation) %>% mutate(species = iris$species)
ggplot(d, aes(PC1, PC2, color=species))+geom_point(size=5)

tSNE

tSNE (t-distributed stochastic neighbor embedding) is a machine learning algorithm to visualize high-dimensional data.

It is a non-linear dimension reduction technique, as opposed to PCA, which is a linear technique

tSNE is also probabilistic, so unless you fix the seed of the random number generator, you will get different results on different runs

Code

set.seed(3457)
d <- as.data.frame(Rtsne::Rtsne(t(df), perplex=3)$Y) %>% 
  mutate(species = iris$species)
ggplot(d, aes(V1, V2, color=species))+geom_point(size=5)

Using iris

UMAP

UMAP (uniform manifold approximation and projection) is another non-linear dimension-reduction technique that is increasingly used in bioinformatics to cluster high-dimensional data. It is believed to preserve both local and global structure and give a “fair” represenation of the data.

Code

bl <- umap::umap(t(df))
out <- as.data.frame(bl$layout) %>% mutate(species=iris$species)
ggplot(out, aes(x = V1, y=V2, color=species))+geom_point()

Using iris

in Python

Code

from sklearn.decomposition import PCA
pca = PCA()

prcomp = pca.fit_transform(df)

out = pd.DataFrame(prcomp, columns = ['PCA1','PCA2','PCA3','PCA4'])
out = pd.concat((out, iris.species), axis=1)

fig,ax = plt.subplots(1,3, sharex=False, sharey=False)
sns.scatterplot(data=out, x='PCA1',y='PCA2', hue='species', ax=ax[0])
ax[0].set_ylabel('')
ax[0].set_xlabel('')
ax[0].set_title('PCA')

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=5, n_iter=300)
out = pd.DataFrame(tsne.fit_transform(df), columns = ['TSNE1','TSNE2'])
out = pd.concat((out, iris.species), axis=1)

sns.scatterplot(data = out, x = 'TSNE1', y = 'TSNE2', hue='species', ax=ax[1])
ax[1].set_ylabel('')
ax[1].set_xlabel('')
ax[1].set_title('t-SNE')

from umap import UMAP
u = UMAP()
embedding = pd.DataFrame(u.fit_transform(df), columns = ['UMAP1','UMAP2'])
embedding = pd.concat((embedding, iris.species), axis=1)
sns.scatterplot(data=embedding, x='UMAP1', y='UMAP2', hue='species', ax=ax[2])
ax[2].set_ylabel('')
ax[2].set_xlabel('')
ax[2].set_title('UMAP')

plt.show()

Using yellowbrick ()

PCA

Code

from yellowbrick.datasets import load_credit
from yellowbrick.features import PCA

# Specify the features of interest and the target
X, y = load_credit();
classes = ['account in default', 'current with bills']

visualizer = PCA(scale=True, classes=classes);
visualizer.fit_transform(X, y);
visualizer.show()

T-SNE

Code

from yellowbrick.datasets import load_credit
from yellowbrick.features import Manifold

# Specify the features of interest and the target
X, y = load_credit();
classes = ['account in default', 'current with bills']

visualizer = Manifold(manifold='tsne', classes = classes);
visualizer.fit_transform(X, y);
visualizer.show()

Voronoi diagrams

A Voronoi diagram splits up a plane based on a set of original points. Each polygon, or Voronoi cell, contains an original point and all areas that are closer to that point than any other.

This can be used in different contexts, but especially with geospatial data

Here, we’re plotting the locations of airports in the USA

Code

airports <- read.csv(here::here("slides/ml/data/airport-locations.tsv"), sep="\t", stringsAsFactors=FALSE)
source(here::here('slides',"ml/latlong2state.R"))
airports$state <- latlong2state(airports[,c("longitude", "latitude")])
airports_contig <- na.omit(airports)
# Projection
library(mapproj)
airports_projected <- mapproject(airports_contig$longitude, airports_contig$latitude, "albers", param=c(39,45))
par(mar=c(0,0,0,0))
plot(airports_projected, asp=1, type="n", bty="n", xlab="", ylab="", axes=FALSE)
points(airports_projected, pch=20, cex=0.1, col="red")

Voronoi diagrams

A Voronoi diagram splits up a plane based on a set of original points. Each polygon, or Voronoi cell, contains an original point and all areas that are closer to that point than any other.

This can be used in different contexts, but especially with geospatial data

Code

library(deldir)
par(mar=c(0,0,0,0))
plot(airports_projected, asp=1, type="n", bty="n", xlab="", ylab="", axes=FALSE)
points(airports_projected, pch=20, cex=0.1, col="red")
vtess <- deldir(airports_projected$x, airports_projected$y)
plot(vtess, wlines="tess", wpoints="none", labelPts=FALSE, add=TRUE, lty=1)

Credit: Nathan Yau

Also, see here for a D3 version)

Regression models

Feature understanding

One of the things we need to understand (beyond patterns we may see in PCA and other dimension reduction methods) is whether we have sets of correlated features that are either explanatory or serendipitious. One way of visualizing this is using a two-dimensional ranking of features.

Code

from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank2D

# Load the credit dataset
X, y = load_credit()

# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D()

visualizer.fit(X, y)           # Fit the data to the visualizer
visualizer.transform(X)        # Transform the data
visualizer.show()              # Finalize and render the figure

Something quite similar is seen in genetics in a linkage disequilibrium or LD plot.

Regression models

We will use the diamonds dataset for the supervised learning section. This data is available in R by library(ggplot2); data(diamonds) and in Python by import seaborn as sns; diamonds = sns.load_dataset("diamonds").

Our initial regression model will look at price as a function of carat, cut, color, clarity and depth.

R
Python

diamonds <- diamonds %>% 
  mutate(across(c(cut, color, clarity), ~factor(., ordered=F)))
model <- lm(log(price) ~ log(carat) + cut + color + clarity + depth,
            data=diamonds)

import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np
diamonds = sns.load_dataset('diamonds')
model = smf.ols('np.log(price) ~ np.log(carat)+ C(cut, Treatment("Fair")) + C(color, Treatment("D")) + C(clarity, Treatment("I1")) + depth', data=diamonds)
result = model.fit()

Model results

library(gtsummary)
theme_gtsummary_compact()
gtsummary::tbl_regression(model) %>% as_gt()

Characteristic	Beta	95% CI¹	p-value
log(carat)	1.9	1.9, 1.9	<0.001
cut
Fair	—	—
Good	0.08	0.07, 0.09	<0.001
Very Good	0.12	0.11, 0.12	<0.001
Premium	0.14	0.13, 0.14	<0.001
Ideal	0.16	0.15, 0.17	<0.001
color
D	—	—
E	-0.05	-0.06, -0.05	<0.001
F	-0.09	-0.10, -0.09	<0.001
G	-0.16	-0.16, -0.16	<0.001
H	-0.25	-0.26, -0.25	<0.001
I	-0.37	-0.38, -0.37	<0.001
J	-0.51	-0.52, -0.50	<0.001
clarity
I1	—	—
SI2	0.43	0.42, 0.44	<0.001
SI1	0.59	0.58, 0.60	<0.001
VS2	0.74	0.73, 0.75	<0.001
VS1	0.81	0.80, 0.82	<0.001
VVS2	0.95	0.94, 0.96	<0.001
VVS1	1.0	1.0, 1.0	<0.001
IF	1.1	1.1, 1.1	<0.001
depth	0.00	0.00, 0.00	0.10
¹ CI = Confidence Interval

Model results

This is a large model, so it would be nice to make the representation a bit more compact using graphics.

We’ll use a tidy representation of the results using broom in R

mod_tidy <- broom::tidy(model)
mod_aug <- broom::augment(model)

Coefficient plots

library(coefplot)
coefplot::coefplot.lm(model) + 
  labs(x='Coefficient', y = 'Predictor')

coefplot produces a ggplot2 graphic, so we could add to this using what we know about ggplot

Coefficient plots

ggplot(mod_tidy %>% filter(term != '(Intercept)'), 
       aes(x = term, y = estimate,
           ymin = estimate - 2*std.error,
           ymax = estimate + 2*std.error))+
  geom_pointrange(color='blue')+
  labs(x = 'Predictor', y = 'Coefficient', 
       title='Coefficient Plot')+
  geom_hline(yintercept=0, linetype=2)+
  coord_flip()

There is actually a confidence interval being drawn, but it’s too small to see compared to the dot representing the estimate

Coefficient plots

Here, I’m taking a 1% sample of the diamonds data for modeling. This increases the magnitude of the standard errors of the coefficients

set.seed(2405)
diamonds2 <- slice_sample(diamonds, prop=0.01) #<<
model2 <- lm(log(price) ~ log(carat) + cut + color + clarity + depth,
            data=diamonds2)
model2_tidy <- broom::tidy(model2)
ggplot(model2_tidy %>% filter(term!= '(Intercept)'),
       aes(x =term, y = estimate,
           ymin = estimate - 2*std.error,
           ymax = estimate + 2*std.error))+
  geom_pointrange(color='blue')+
  labs(x = 'Predictor', y = 'Coefficient',
       title = 'Coefficient Plot')+
  geom_hline(yintercept = 0, linetype=2)+
  coord_flip()

Marginal effects plots

library(ggeffects)
pr <- ggpredict(model, 'carat', condition = c(color = 'F'))
ggplot(as.data.frame(pr),
       aes(x = x, y = predicted))+
  geom_line()+
  geom_ribbon(aes(ymin = conf.low,
                  ymax = conf.high))+
  scale_y_continuous('Predicted price',
                     labels = scales::label_dollar())+
  labs(x='Carat')

The other variables are held at the following values:

Code

library(ggeffects)
pr <- ggpredict(model, 'carat')
knitr::kable(as.data.frame(attr(pr, 'constant.values')))

cut	color	clarity	depth
Fair	D	I1	61.7494

Marginal effects plots

pr <- ggpredict(model, 'cut')
ggplot(as.data.frame(pr),
       aes(x = x, y = predicted,
           ymin=conf.low,ymax=conf.high))+
  geom_point()+
  geom_errorbar(width=.2)+
  scale_y_continuous('Predicted price', 
                     labels = scales::label_dollar())+
  labs(x='Cut')

You can also draw marginal effects for categorical predictors

Marginal effect plots

pr <- ggpredict(model, c('carat','color'))
ggplot(as.data.frame(pr),
       aes(x=x, y =predicted,
           ymin=conf.low, ymax=conf.high))+
  geom_line(aes(color = group))+
  scale_x_continuous('Carat')+
  scale_y_continuous('Predicted price',
                     labels = scales::label_dollar())+
  scale_color_viridis_d('Color')

You can also look at marginal effects for one variable conditional on one or more other variables

Model diagnostics

Residual plots

ggplot(mod_aug, 
       aes(x=.fitted, y = .std.resid))+
  geom_point()+
  geom_hline(yintercept=0, 
             linetype=2,
             color='blue')+
  geom_smooth(color='red')+
  labs(x='Predicted values', 
       y = 'Standardized residuals')

We expect, in well-fitting models, that the residuals are on average 0, and that they have no pattern with the predicted/fitted values

Q-Q plots for residuals

An assumption of linear (OLS) regression is that the errors are distributed as a Gaussian (normal) distribution.

This can be visually checked using a quantile-quantile (Q-Q) plot.

If the residuals follow a Gaussian distribution, this plot should be close to the diagonal line

Code

ggplot(mod_aug, aes(sample=.std.resid))+stat_qq()+
  geom_abline(linetype=2)+
  labs(title = 'Residual Q-Q plot', y = 'Standardized residuals')

Residual plots (yellowbrick)

We can see both the pattern and histogram of residuals

Code

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot

# Load a regression dataset
X, y = load_concrete()

# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the linear model and visualizer
model = LinearRegression()
visualizer = ResidualsPlot(model, hist=False, qqplot=True)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()                 # Finalize and render the figure

Cook’s distance

Cook’s distance or Cook’s D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis

Cook’s distance measures the effect of deleting a given observation. Points with a large Cook’s distance are considered to merit closer examination in the analysis.

R
Python

Code

ggplot(mod_aug, aes(x = seq_along(.cooksd), .cooksd))+
  geom_col(fill = 'red') +
  labs(x='Obs number', y = "Cook's distance")

Code

from yellowbrick.regressor import CooksDistance
from yellowbrick.datasets import load_concrete

# Load the regression dataset
X, y = load_concrete()

# Instantiate and fit the visualizer
visualizer = CooksDistance()
visualizer.fit(X, y)
visualizer.show()

Leverage

Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.

High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.

Code

ggplot(mod_aug, aes(.hat, .std.resid))+
  geom_point(aes(size=.cooksd))+
  geom_smooth(se=F) +
  labs(x = 'Leverage', y = 'Residuals', size="Cook's D")

Machine learning models

Describing ML models

In this section, we will consider a classification machine learning model to predict breast cancer from various pathological variables, which is available as BreastCancer in the mlbench package in R

We will fit a Random Forest model to this data

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, PrecisionRecallDisplay, RocCurveDisplay
from sklearn.inspection import PartialDependenceDisplay
from sklearn.calibration import calibration_curve
from yellowbrick.model_selection import FeatureImportances

brca = pd.read_csv('ML/data/brca.csv').dropna()
y = brca.pop('Class').astype('category').cat.codes.values
X = brca.values

rfc = RandomForestClassifier();
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
rfc.fit(X_train, y_train);

Model description

Decision trees

Code

library(partykit)
library(mlbench)
data("BreastCancer")
BreastCancer %>% select(-Id) %>% 
  rpart(Class~., data=.) %>% partykit::as.party()->bl
plot(bl)

Uses partykit package

Model introspection

Feature importances

Feature importances are typically a measure of how much a predictive performance metric is reduced when the values of the a variable is scrambled (permuted). They represent the predictive power of a variable

Feature importances and p-values aren’t necessarily related; a variable can be predictive and not statistically significant and vice versa. The former depends on the size of the effect, and the latter depends on the signal-to-noise ratio

Code

fi = pd.DataFrame({'features': brca.columns, 'importance': rfc.feature_importances_})
sns.barplot(data = fi, x = 'importance', y = 'features', color='blue')
plt.show()

Feature importances

Code

viz = FeatureImportances(RandomForestClassifier());
viz.fit(brca,y);
viz.show();

Regularization (lasso/ridge/elastic net)

Lasso regression fits a penalized regression with \(L_1\) penalties on the coefficients. This model minimizes

\[ ||y-X\beta||^2_2 + \lambda ||\beta||_1 \] where \(||x||_1 = \sum |x_i|/n\) and \(||x||_2 = \sqrt{\sum x_i^2/n}\)

This forces slope coefficients to 0 as the value of \(\lambda\) changes.

The plot shows the values of the slope coefficients for different values of \(\log(\lambda)\)

Code

library(mlbench)
library(glmnet)
data("BreastCancer")
X = BreastCancer %>% 
  filter(complete.cases(.)) %>% 
  select(-Id, -Class) %>% 
  mutate(across(everything(), ~as.numeric(as.character(.)))) %>% 
  as.matrix()
y <- BreastCancer %>% 
  filter(complete.cases(.)) %>% 
  pull(Class) %>% as.numeric()
fit <- glmnet(X, y, family='binomial')
plot(fit, 'lambda', main = 'Lasso regression', label = T)

Regularization (lasso/ridge/elastic net)

The elastic net regression (ELR) minimizes the penalized objective function

\[ ||y-X\beta||^2_2 + \alpha\lambda ||\beta||_1 + \frac{(1-\alpha)\lambda}{2} ||\beta||_2^2 \] Notice that in ELR, the coefficients move to 0 at a slower rate than for lasso regression

\(\alpha=1\): Lasso regression
\(\alpha=0\): Ridge regression

Regularized regression is typically used for variable selection in regression models, where \(\lambda\) is estimated via cross-validation

Note that the actual coefficients are provably biased, so they are not interpreted.

Code

library(mlbench)
data("BreastCancer")
X = BreastCancer %>% 
  filter(complete.cases(.)) %>% 
  select(-Id, -Class) %>% 
  mutate(across(everything(), ~as.numeric(as.character(.)))) %>% 
  as.matrix()
y <- BreastCancer %>% 
  filter(complete.cases(.)) %>% 
  pull(Class) %>% as.numeric()
fit <- glmnet(X, y, family='binomial', alpha = 0.5)
plot(fit, 'lambda', main = 'ELR (alpha=0.5)', label = T)

Model prediction

ROC curves

A receiver operating characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity).

A classifier no better than chance will have a ROC curve along the unit diagonal

Code

RocCurveDisplay.from_estimator(rfc, X_test, y_test);
plt.plot([0,1],[0,1], '--')
plt.show()

Precision-recall curves

Precision = positive predictive value = P(True + | prediction +)
Recall = sensitivity = P(prediction + | True +)

The precision-recall curve is used for evaluating the performance of binary classification algorithms. It is often used in situations where classes are heavily imbalanced.

A “good” classifier will have high precision and high recall throughout the graph, to be close to the upper right corner

Code

PrecisionRecallDisplay.from_estimator(rfc, X_test, y_test);
plt.show()

Classification report

Code

from yellowbrick.classifier import ClassificationReport

classes = ['benign','malignant']
viz = ClassificationReport(RandomForestClassifier(), classes=classes, support=True);
viz.fit(X_train, y_train);
viz.score(X_test,y_test);
viz.show()

Confusion matrix

Code

ConfusionMatrixDisplay.from_estimator(rfc, X_test, y_test);
plt.show()

Calibration curves

We can predict the probability that an individual is a member of a class, rather than the predicted class. One property of the learning methods we can then assess is calibration.

Calibration curves are used to evaluate how calibrated a classifier is i.e., how the probabilities of predicting each class label differ. We bin the data, and for each bin we compute the average predicted probability (x-axis) and the proportion of positives (y-axis).

We expect bins with low proportion of positives to also have low average probability of being in the positive class, and vice versa

Code

from sklearn.calibration import calibration_curve
c1,c2 = calibration_curve(y_test, rfc.predict_proba(X_test)[:,1])
plt.plot(c1,c2, 's-')
plt.plot([0,1],[0,1],'--')
plt.xlabel('Mean predicted value')
plt.ylabel('Fraction of positives')
plt.show()

Prediction vs actual

This plot works well for continuous outcomes

We can see how well our predictions match the actual data

Code

ggplot(mod_aug, aes(x = `log(price)`, y = .fitted))+
  geom_point(alpha=0.2)+
  geom_abline(color='blue', linetype=2)+
  geom_smooth(color='red')+
  labs(y = 'Predicted log(price)')

Survival/reliability models

Lifetime plots

Survival analysis methods can be applied to a variety of domains:

Difference in death rate between two treatments
Failure rate of machines
Customer churn

Survival data is characterized by censoring, i.e. when an individual is lost to followup while still “alive”, and is assumed to be independent of “failure” (violations of this assumption requires different methods)

This plot looks at how long we followed each individual and whether they failed (red) or were censored (white) the last time we saw them

Code

library(survival)
pbc$status2 = factor(ifelse(pbc$status==2, 'Dead', 'Alive'))
pbc <- pbc %>% 
  mutate(id = factor(id)) %>% 
  mutate(id = fct_reorder(id, time))
ggplot(pbc, aes(x = time,y=id))+
  geom_point( aes(color=factor(status2)), size=2)+
  geom_segment(aes(x=time, xend=0, y=id, yend=id, color=factor(status2)))+
  scale_color_manual(values = c('white','red'))+
  labs(x = 'Time (days)', title = 'PBC study', color='Status')+
  theme(axis.text.y=element_blank(),
        axis.title.y=element_blank())

Kaplan-Meier plots

The Kaplan-Meier estimator is used to estimate the survival function \(\Pr(T > t)\), where \(T\) is the time to failure.

The visual representation of this function is usually called the Kaplan-Meier curve, and it shows what the probability of failure is at a certain point of time.

From the plot, at 1000 days, about 75% of patients are expected to survive.

Code

fit <- survfit(Surv(time, status==2)~trt, data=pbc)
survminer::ggsurvplot(
  fit,
  data=pbc,
  pval = T, 
  risk.table=T
)

Note

Add KM curves from ggsurvfit, and Python examples from https://erdogant.github.io/kaplanmeier/pages/html/index.html and