Experimental design and inference

Lab 2

Note

Github classroom link: https://classroom.github.com/a/2KPYXw8h

In this lab we will look at different experimental designs, and how they work with inference and decisions. We’ll also start looking at how experimental design can help with dealing with confounders, a topic we will continue into next week.

We will start with the same population that we provided in the first homework

pop <- readRDS('lab-02/pop.rds')
dim(pop)

[1] 1000000       5

Let’s first create a cohort study from this population

# set.seed(2048)
set.seed(20585)
cohort <- pop |>sample_n(50000)

Let’s get an estimate of the odds ratio and it’s confidence interval from this cohort.

library(rsample)
OR <- cohort  |> tabyl(gene, sick) |> 
  select(-1) |>
  odds_ratio()
b <- bootstraps(cohort, times = 1000)  |>
  mutate(OR = map_dbl(splits, \(sp) analysis(sp) |> 
    tabyl(gene, sick)  |>
    select(-1)  |>
    odds_ratio()))
quantile(b$OR, c(0.025, 0.975))

    2.5%    97.5% 
1.899254 2.557906

We can estimate the odds ratio between gene and sick using logistic regression

model1 <- glm(sick ~ gene, data = cohort, family = binomial)
tidy(model1, exponentiate = TRUE, conf.int=TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0585    0.0199    -142.  0          0.0562    0.0608
2 gene          2.20      0.0774      10.2 1.70e-24   1.89      2.56

What happens if we adjust by age?

model2 <- update(model1, sick ~ gene + age_grp)
tidy(model2, exponentiate = TRUE, conf.int = TRUE)

# A tibble: 8 × 7
  term           estimate std.error statistic  p.value conf.low conf.high
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)      0.0275    0.0731    -49.2  0          0.0237    0.0316
2 gene             2.23      0.0782     10.2  1.35e-24   1.91      2.59  
3 age_grp(25,35]   1.36      0.0912      3.34 8.28e- 4   1.14      1.62  
4 age_grp(35,45]   1.57      0.0898      5.05 4.31e- 7   1.32      1.88  
5 age_grp(45,55]   2.21      0.0869      9.14 6.37e-20   1.87      2.63  
6 age_grp(55,65]   2.68      0.0846     11.6  2.30e-31   2.27      3.17  
7 age_grp(65,75]   3.70      0.0840     15.6  1.01e-54   3.15      4.37  
8 age_grp(75,85]   4.74      0.0992     15.7  1.87e-55   3.90      5.76

What happens if we adjust by gender?

model3 <- update(model1, sick ~ gene + gender)
tidy(model3, exponentiate = TRUE, conf.int = TRUE)

# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   0.0703    0.0261   -102.   0          0.0668    0.0740
2 gene          1.92      0.0784      8.34 7.42e-17   1.65      2.24  
3 genderM       0.673     0.0399     -9.94 2.73e-23   0.622     0.727

Let’s do a stratified analysis, by gender.

cohort |>
  nest(data = -gender) |>
  mutate(OR = map_dbl(data, \(d) d  |> tabyl(gene, sick) |>select(-1) |> odds_ratio()))

# A tibble: 2 × 3
  gender data                     OR
  <chr>  <list>                <dbl>
1 M      <tibble [24,767 × 4]>  1.77
2 F      <tibble [25,233 × 4]>  1.94

Create a matched case-control study, matching on gender

set.seed(29058)
cases  <- pop |>filter(sick == 1) |>sample_n(250)
cases |>tabyl(gender)

 gender   n percent
      F 142   0.568
      M 108   0.432

controls1 <- pop  |> filter(sick == 0) |>sample_n(250)
controls2_male <- pop |>filter(sick==0, gender == 'M') |>sample_n(sum(cases$gender=="M"))
controls2_female <- pop |>filter(sick == 0, gender == "F") |> 
  sample_n(sum(cases$gender == "F"))
controls2 <- rbind(controls2_male, controls2_female)
cc_unmatched  <-  rbind(cases, controls1)
cc_matched <- rbind(cases, controls2)

Now compute the odds ratio for both the unmatched and matched case control studies, stratified by gender.

Construct a cohort the same size as the case-control study, and compare the efficiency of the unadjusted gene-sick odds ratio

Experimentally find the rough size of the cohort study that is needed for this population to match the efficiency of the results from the case-control study.