Abstract
In the past decades, technological advances have provided the ability to collect, store and analyse Big data in disease biology. The amount and complexity of Big data requires us address the challenges posed by the large volume, veracity and heterogeneity. This includes, for instance, data integration across high-throughput omics data, electronic medical records and environmental characteristics. Interpretation of this heterogeneous data promises to derive value towards understanding health, disease and treatment outcomes at an individual patient level. The promise of precision medicine is to use individual variability in genomics, environment and lifestyle to guide personalized prevention and treatment strategies. Current methods in genomics employ association testing of single genetic variants in large cohorts. Even though challenged by the requirement of very large patient cohorts, these methods have produced several valuable insights over the past decade. Their limitation is in not providing a straightforward strategy for correlation of multiple features to exploit the complex interactions that exist in biology, and the difficulty in clinically translatable insights given the low effect sizes of individually associated variants. Machine learning approaches facilitate analysis of large and complex biological data by detection of non-linear interactions between heterogenous features and offer strategies to handle high-dimensional datasets. Furthermore, machine learning offers individual-level predictions which can be translated into personalized recommendations for prevention or treatment of disease outcome. These methods certainly have their own challenges, however are beginning to show promise in smaller patient cohorts. Eventually, the goal is to derive information from rich datasets, that may inform clinical practice, and step towards precision medicine. This PhD thesis consists of three research projects where prediction models of different health and disease outcomes were developed using machine learning approaches with potential applications in precision medicine. The first project explored predictive modelling of weight loss in dietary interventions of eight weeks with a whole grain-rich diet, a low-gluten diet or a refined grain in apparently healthy Danish individuals (N = 102) at cardiometabolic risk. The individuals were deeply phenotyped with several phenotypic and biochemical characteristics as well as genotype, gut microbiome and urine metabolome data. Given the challenge of a small cohort with several high-dimensional data for machine learning, several data transformations and reductions were made on the features to improve predictive power such as polygenic risk scores. In addition, feature engineering such as modelling variability of longitudinal post-prandial response, improved eventual predictive outcome. Finally, an ensemble model capturing different aspects of biology was combined from the most predictive individual models. The developed model may function as an early screening tool when determining individual weight loss strategies. The second project established prediction models of the time to insulin in type 2 diabetes patients. Artificial neural networks were used to integrate longitudinal biomarkers of drug prescriptions, biochemistry, anthropometry and blood pressure collected in electronic medical records for up to 20 years follow-up as well as information on lifestyle, social deprivation and genotype. Electronic medical records contain irregular sampled measurements with varying length of individual patient trajectories. Thus, these were formatted into a structured representation of data (both by a single time point and a longitudinal approach). By assessing individual risk of insulin requirement across approximately 6000 patients, the model may eventually assist in reduction of clinical inertia in very high risk patients or reduce health care interventions and costs by identifying very low risk patients. The third project focus on prediction of asparaginase-associated pancreatitis (AAP), a serious treatment toxicity in childhood acute lymphoblastic leukemia (ALL) (N = 1390, 205 AAP cases). It is currently difficult to predict which patients are at risk of AAP. Even more difficult is the prediction of patients that will develop a second AAP following re-exposure to asparaginase after the first AAP event. Machine learning algorithms was used to establish SNP-based prediction models allowing to capture interactions between genetic variants for AAP and second AAP. One path of least disruptive implementation could be to only use the extremes in the model output where both low risk and high risk patients are identifiable at very high confidence. For instance, identification of high-risk patients of AAP can influence decision for increased monitoring, while patients at low risk of a second AAP can influence the decision to re-expose patients to asparaginase. The models may eventually assist in further stratification of patients and adjustment of treatment protocols of childhood ALL. In summary, this thesis demonstrates the feasibility of machine learning approaches to integrate different types of health and disease data as well as their utility for subgroup and individual-level predictions of complex disease outcomes. It also explores how value can be derived from combinations of heterogeneous and longitudinal patient data. Eventually, the goal of these studies is to find a path for implementation and translation of findings from machine learning models to personalized strategies in prevention or treatment of disease.