A general framework to adjust for missing confounders in observational studies

Project Background:

This is a methodological project to help enhance the use of administrative data such as hospital admissions, cancer registrations and mortality in health research.  Assessing the impact of a risk factor/exposure X on a health outcome Y in observational (epidemiological) studies is invariably subject to confounding issues. Cohort (individual-level) studies are an ideal source of information as they typically contain a rich set of individual level variables. Nevertheless a study based only on a cohort may suffer from problems of selection bias and lack of population representativeness. Cohort studies may also lack statistical power to assess rare outcomes, and geographical or other group-level variations which limits the extent to which contextual factors such as area level social deprivation can be investigated.

Study Aims:

Routinely collected administrative data are a good alternative in terms of representativeness; however, these data sources typically have a limited number of variables for a large population, and might miss important predictors/confounders leading to potentially biased estimation of the risks.
We propose a general framework that integrates these two sources of data and build a propensity score like index to summarise the values
 of the confounders from the cohorts/surveys so we will need to impute only one variable when missing; through a flexible model the index will be included in the epidemiological analysis to provide a direct estimate of the link between X and Y.

Health data: 

NHS Digital HES inpatients - CVD/Asthma 1994-2001, ONS cancer registrations - Lung Cancer 1999-2003, Health Survey for England (1994-2001), Millennium Cohort study. Data from Health Survey for England for 2002 onwards will be also requested.

Benefits to Public:

This project will identify the need for integration of data sources in order to account for residual confounding, thus reducing/eliminating the bias in the estimates of epidemiological risk. Output will consists of scientific papers and computer code, the project is estimated to be completed by 2018.