Technical Introduction to Data Science and Health Disparities

This learning research community is a 10-week series of technical lectures and workshops to introduce health disparity researchers to core data science concepts and techniques. Lectures will introduce fundamental principles and techniques of data science in order to extract useful information and knowledge from data. In parallel to lectures, there will be an R workshop where participants will also learn how to explore data, define cohorts and build participant-level datasets using the All of Us Researcher Workbench. Participants will learn how to write reproducible and modular code with R, including programming best practices.


Course Updates

  • Week 3 Recap:
    •  Classification is the process of training a model on labeled examples to identify which of a set of categories a new observation belongs to.​
    • Logistic regression is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables.
    • Logistic regression can be used for classification and quantifying health disparities as the odds ratios which estimates the strength of associations between two events occurring, cancer mortality in group A vs cancer mortality in group B.  
  • Week 2 Recap:
    • We can define health disparities as differences in health outcomes for defined racial and ethnic minority populations or any sub population that are any better or any worse than the (specified) reference population.​
    • According to NIH guidance, self-identification is the preferred means of obtaining race and ethnic identity.
    • Data empathy is understanding the ‘story’ of a dataset. ​
  • Week 1 Recap:
    • Data science is a process that starts with our understanding of health disparities, which must influence our understanding of data, data preparation for modeling, and evaluation of our models.
    • Data science and machine learning allow us to capture higher levels of complexity: descriptive (i.e., what happened), diagnostic (i.e., why did it happen), predictive (i.e., what will happen), and prescriptive (i.e., how can we make it happen)
    • Machine learning models can be broadly classified into two categories. Supervised methods are models that use training data with labeled examples for classification. Unsupervised methods do not require labeled training examples, but are instead used for clustering and require interpretation of patterns identified from the data.
  • Our first lecture will be Tuesday 9/6 at 6:30PM EST.
  • We are creating an online community using a platform called Discord where we will host lectures and discussions during and outside of lectures. We are also hoping to use the platform for regular communications. You can join the community using this link: The signup process is pretty straightforward but please feel free to reach back out if you have any issues.
  • Due to overwhelming interest in the series and a dearth of qualified instructors we are unable to host everyone for the R workshop series. Instead, we are putting together a smaller cohort (by randomly sampling registered participants who have completed the course pre-survey) to test drive the curriculum and train 10 instructors who will be teaching cohorts in the Spring. We will provide more details during our first lecture next Tuesday.