Tracing Sources of Zoonotic Salmonella Infections and Contamination using Whole Genome Sequencing Data and Machine Learning
Abstract
Biological and chemical agents can contaminate food and potentially cause foodborne outbreaks and sporadic cases of disease. Salmonella is a foodborne bacterium that caused about 90,000 human salmonellosis cases in Europe of which about 1,100 were reported in Denmark in 2017. Therefore, preventing disease in the animal reservoir and thus contamination of the food is desired to protect the consumer and improve public health. Source attribution models link sporadic human cases of a specific illness, such as salmonellosis, to specific food sources and animal reservoirs. A Salmonella source attribution model estimates how many human Salmonella infections each source causes. The source account is based on human salmonellosis cases and positive Salmonella samples from animal and food registered as part of the Danish national surveillance programs for animals, food and human. Results guide decision makers about prevention and control of human salmonellosis cases and improve public health. A Danish Salmonella source account has been published in the ”Annual report on zoonoses in Denmark” every year for the past two decades. Until 2017 the source account was based on a Bayesian modelling approach that accounted for prevalence of the Salmonella types in the different sources. The types have over the years been defined by serotyping, phage typing, MLVA typing and resistance profiling. But, from January 2017, sero- and MLVA typing of isolates collected as part of the Danish Salmonella surveillance was replaced by whole genome sequencing. Therefore, new models that can take sequence data into account is needed. The overall aim of this thesis was to develop a new model for source attribution of human salmonellosis cases to potential sources using whole genome sequencing data and machine learning. Eligible datasets of whole genome sequences of Salmonella Typhimurium isolates and associated metadata were collected from four European countries: Denmark, Germany, the United Kingdom and France. The objective of the datasets was to attribute human salmonellosis cases to animal reservoirs (Danish, German and British datasets) and to investigate contamination of the environment by attributing the environmental isolates to different animal reservoirs (French dataset). A supervised logit boost machine learning algorithm trained on the allelic variations in the core genes of the Danish food and animal dataset predicted 81% of the Danish human salmonellosis cases. Results were in line with those obtained applying the previously used Bayesian model. The machine learning model was successfully applied to the German and British dataset proving its generalizability to sequence data from other countries than Denmark. Eating patterns are steadily changing towards a more plant-based diet, and the Bayesian and the machine learning source attribution models were used to identify potential sources of Salmonella contamination of Australian macadamia nuts and the French marine environment, respectively. Findings suggest that macadamia nuts were contaminated by direct transmission from animals with access to plantations (such as kangaroos) or indirectly from animal reservoirs via biosolids-soil-compost. It is uncertain whether the machine learning approach can be applied to investigate contamination of non-human sources as investigating the contamination of the French marine environment was unsuccessful. No allelic variations were found probably due to the low sample sizes of other sources than pigs. If sufficient strains had been included, it is likely that biological variations would have been identified. Therefore, we assume that the model could be successful if applied to a dataset with higher sample sizes of all sources included. In conclusion, this PhD collected four European datasets of Salmonella Typhimurium sequences and developed a new approach to source attribution. The machine learning approach is a recommended option when sequence data are available. It was also demonstrated that source attribution analyses originally developed to attribute human salmonellosis to animal reservoirs can be a first step when investigating the origin of microbiological contamination of fresh produce exemplified with Australian macadamia nuts. This PhD thesis is also a result of international and interdisciplinary work and communication between laboratory technicians, epidemiologists, microbiologists and bioinformaticians which is utterly important when working with genomic epidemiology.