Lecture 1

Course intro, expectations, introduction to biological and biomedical data science

Abhijit Dasgupta

Georgetown University

Fall 2024

Agenda and Goals for Today

Lecture

  • Course introduction and expectations
  • Introduction to biological and biomedical data
  • Refresher on statistical foundations and fundamentals

Lab

  • Simulations and resampling in

Your instructional team

Abhijit Dasgupta

  • Data Science Associate Director at AstraZeneca supporting Oncology R&D
    • bioinformatics, biomarkers, clinical studies
    • survival analysis, Bayesian, autoencoders, signal processing
  • Adjunct Professor at Georgetown since 2020
  • and reproducible research evangelist
    • is cool too!!
  • Co-founder of Statistical Programming DC (with Marck Vaisman)

Fun Facts

  • I’m a 4th degree black belt in Aikido,
    • over 30 years experience providing flyer miles
  • Exploring global whiskey, currently on Japan
  • Active in community theater, mainly behind but sometimes on-stage.

Teaching assistants

Viviana Luccioli

  • Double majored in Biostatistics and Global Public Health at the University of Virginia
  • First year in DSAN
  • Internship experiences: data manager & strategist for a local nonprofit combatting generational poverty, WHO research assistant on controlled medicine policy
  • Professional interests: advancing epidemiology methods (both biological and social) with data science
  • Personal interests: traveling, language exchange, yoga, music (I teach piano lessons as a fun little side hustle), and learning how to cook new dishes

Teaching assistants

Prerana Mandalika

  • Educational background: Majored in Computer Science from Chaitanya Bharathi Institute of Technology.
  • I am currently in my second year in the DSAN program.
  • Internship experience : Interned at KPMG, as a software developer, where I developed a java application for a contract management system.
  • Work experience : Worked at Micron for two years, as a software developer, contributing to the development of SQL stored procedures, RPA based automation tasks and code refactorization & debugging of the SAP payroll integration code in SAP ABAP.
  • Personal interests : Love singing and dancing!

About the Course

Course description

  • First of all, this is not a course in biology, epidemiology or clinical studies, but a course in data science, statistical methods and data analysis. We will use biology and epidemiologic/clinical studies as a source of data and examples, to contextualize different data science methods.
  • This course will emphasize decision-making using statistical inference, hypothesis testing, and modeling, but will also cover more topics such as survival analysis, Bayesian inference, machine learning, and high-dimensional data analysis.
  • An important distinction in these domains from usual machine learning applications is the emphasis on causality and explanation rather than prediction.
    • We’re more interested in understanding what affects the outcome and how, rather than just predicting the outcome. This is also driven by regulatory agencies, who require an understanding of why a drug works, and why the pathway the drug is targetting actually affects the disease.
  • A lot of the things we’ll learn in this course will be applicable to other domains.
    • Survival analysis (also called reliability) is central to many engineering and manufacturing applications
    • High-dimensional data analysis is central to many fields, including marketing, finance, and engineering
    • Bayesian inference is central to many decision-making processes

Learning objectives

  • Promoting analytic and critical thinking skills, developing the ability to make logical conclusions from data that are generalizable and robust
  • Understanding different levels of evidence, and how experimental design can help
  • Learning how to validate models and model assumptions
  • Understanding how resampling can help make robust inference
  • Seeing the role Bayesian inference can play in making decisions
  • Learning how to plan and design studies to answer particular questions
  • Understanding the difference between association and causality
  • Handling high-dimesional data.
  • Understanding the pros and cons of statistical hypothesis testing
  • Understanding how to handle missing data statistically
  • Learning how to assess explainability of AI models
  • Learning how survival analysis/reliability methods can help us understand time-to-event data, while accounting for particular missing data structures

Our roadmap

  • Review of basic statistical concepts
  • Experimental design, and the levels of evidence
  • Confounding and causality
  • High-dimensional data analysis
  • Survival analyses
  • Biomarkers
  • Bioinformatics, genomics and systems biology
  • Clinical trials
  • Statistical inference (testing and estimation)
  • Simulation, resampling and in silico experiments
  • Bayesian methods
  • Unsupervised learning (clustering, t-SNE, UMAP, autoencoders)
  • Supervised learning (regression, ML, deep learning)
  • Explainability and interpretability

Class schedule

This class schedule is tentative and may change based on the pace of the class

Week Date Topic Notes
1 2024-08-28 Introduction to biological and biomedical data
2 2024-09-04 Experimental design and confounding
3 2024-09-11 Causal inference
4 2024-09-18 Observational studies, outcome sampling
5 2024-09-25 Survival analysis I
6 2024-10-02 Survival analysis II
7 2024-10-09 High-dimensional data
8 2024-10-16 Biomarker discovery
9 2024-10-23 Applied Bayesian methods
10 2024-10-30 Planning studies and clinical trials
11 2024-11-06 Applied machine learning
12 2024-11-13 Explainability
13 2024-11-20 Decision-making

2024-11-27 Thanksgiving break No class
14 2024-12-04 Project presentations

Class format and deliverables

  • We will try to run this class as a flipped classroom.
    • You will be expected to read the material (course notes, readings, tutorials) before class, and come prepared to discuss and work on problems in class.
    • There will be a short quiz at the beginning of each class to ensure that you have read the material.
    • There will be some didactic elements presented during class, but the main focus will be on working through problems and exercises. These will not be traditional teaching lectures.
  • Each class will have a laboratory portion, which will have a deliverable.
    • These deliverables will be submitted to the course GitHub organization
    • The deliverables will be graded for completion and feedback will be provided
  • Assignments based on the week’s material will be assigned each week and will be graded for correctness, completion and quality
    • These will be submitted to the course GitHub organization
    • You may drop up to 2 assignments without penalty

Due dates

  • Quizzes will be available 1 hour before class and will be due at the end of class.
    • There will be no make-up quizzes or late submissions allowed for quizzes.
    • You can drop up to 3 quizzes without penalty
  • Labs and assignments will be due by midnight on the Thursday following the class.
    • Late assignment submissions will be penalized by 10% per day late for a max of 2 days (before the end of the next class). You can submit any assignment before the end of the term for a maximum score of 60%
    • You can drop up to 2 assignments without penalty

Grade weights

  • Assignments : 25%
  • Lab completions and quizzes: 10%
  • Participation: 30%
  • Group project : 35%

See the syllabus for the late policy and other evaluation information

Term project

  • Data analysis projects
    • use publicly available data to answer a biological or biomedical question.
    • Your analytic methods must be concordant with the study design used to collect the data.
    • use at least two datasets, one to develop your analytic workflow, and the rest to validate your results.
    • All analytic code (in or ) must be submitted as part of the project, in the form of a research compendium. The analytic code must be in the form of a reproducible workflow.
  • Review projects
    • a literature review around methods to answer partcular analytic questions applicable to the life sciences.
    • This will involve developing a well-annotated bibliography, data analysis or in silico experiments
    • clear discussion on the strengths and limitations of the methods
    • or code to implement the method, either as a well-annotated set of scripts forming a workflow, or a package

You will form groups of 3-4 by Week 3, and submit a project propsal in Week 4. More details next week.

Course notes

We will have extensive course notes for this class that will be the main reading material for this course. These notes will be linked from the course website (https://gu-dsan.github.io/6150-fall-2024) and will be updated regularly.

  • These notes are a distillation of material from various sources, including textbooks, research papers, and online resources, and well as my own experience teaching, collaborting and using the methods in practice.
  • The course notes will contain references, links to supplemental readings, questions for reflection, and optional exercises. These are for your benefit and not graded work for the course unless otherwise specified.
  • The notes are a work in progress and will benefit from your inputs for future improvements
    • The course notes are open-access on Github, and you are welcome to point out errors or typos as issues, or, better yet, fork the repository, correct the mistake, and submit a pull request.
    • You are also welcome to suggest tutorials, examples, or exercises that you think would be beneficial to the class.

Participation

There is significant grade weight on participation in this class.

  • We want you to be engaged in the class, and to participate in discussions and exercises. We want to promote considered thought and critical analytic thinking.

  • We understand that not everyone is comfortable speaking up in class, so we will have other ways for you to participate

    • You will contribute to a weekly shared document with (a) questions for discussion from the topic of the week, or (b) questions around the readings and homework for the week, or (c) answers to questions posed by other students. This is a required element of the class.
    • You will have the opportunity to create shared notes on a Google Doc, that will be a living document for the semester. This is optional and you will not be penalized for not participating, but we’ve found that this is a really good practice to have.
    • You can ask questions on the Slack channel
    • You can participate in discussions on Slack or on Canvas
    • You can update/improve/correct the class notes on Github, using forks and pull requests. You’ll get credit if your pull request is accepted.

This grade will be holistic.

Important

I will be taking attendance in each class, which will be a part of your participation grade

Software

We will mainly use version 4.3 or above in this class

  • In particular, we will make heavy use of the tidyverse group of packages and its philosophy
  • We will also make use of the tidymodels group of packages for modeling, the survival package, and various other packages
  • We will also introduce and use packages from Bioconductor, a community effort to create analytic and data packages in R for biological data and bioinformtic analysis

The course material will primarily use , but most of the functionality is also available in the ecosystem and work may be done in Python as long as the results are comparable to the R results.

We will also make extensive use of Quarto for reproducible documents and presentations. Within the Quarto ecosystem, we will start using WebAssembly technologies, particularly webR for interactive components that can run without a server.

Expectations

We have high expectations and strong opinions!

  • This is a professional, graduate level program, and you are 2nd year students
  • You must use excellent practice in coding and analysis and make choices that support the study design and the data.
  • Your spelling and grammar must be accurate.
  • All submissions should look professional, using proper English and good, professional formatting.

All submissions should look professional and distinctive. Think of them as examples of your own brand, and can be part of a portfolio that showcases your work.

YOUR WORK MUST BE PUBLICATION READY!

It ain’t that easy

We’re aiming to become professionals here

  • Though we’ll be doing analyses that may look like stuff you’ve done in your first year, there is a need for thought and rigor and critical thinking at this level that was not required in your first year.
  • Choices you make in your analysis will have to be justified. Realize that analysts at your level can be part of the decision about whether a drug is approved or not. So every link must ring true.
  • Mistakes can have consequences in the real world. So check your work carefully, or, even better, build in checks that will catch mistakes in your code or analysis.

Let’s avoid this

We’re gonna have an awesome time!