Combining Machine Learning with Bayesian Inverse Modelling to Estimate the Conditional Probability of Developing Oropharyngeal Cancer following an Oral Human Papilloma Virus Infection

*Prerna Tewari , *Eugene Kashdan, Cathal Walsh, Cara M Martin 2 Andrew C Parnell, John J O’Leary 2 Dept. Histopathology and Morbid Anatomy, Trinity College Dublin, Dublin, Ireland Dept. Pathology, Coombe Women & Infants University Hospital, Dublin, Ireland College of Business, University College Dublin, Dublin, Ireland Dept. Mathematics and Statistics, University of Limerick, Limerick, Ireland Hamilton Institute, Insight Centre for Data Analytics, Maynooth University, Kildare, Ireland


23
We have employed a novel statistical approach to estimate the conditional probability of developing 24 OPSCCs following an oral HPV infection and covariates age, sex, ethnicity and marital status in the US 25 population. We recognise that at best this is a first guess estimate of a natural history model of HPV driven 26 OPSCCs within the existing limitations of the model.

30
HPV infections are now firmly established as the primary cause of cervical cancers, some anogenital 31 cancers, and a subset of Head and Neck cancers [1][2][3]. Head and neck cancers represent a diverse group of 32 cancers that develop from different anatomical sites; oral cavity, oropharynx (tonsils, base of tongue), nasal 33 cavity, nasopharynx, hypopharynx and larynx [4]. While there has been a decrease in the incidence of 34 laryngeal, hypopharyngeal and oral cavity cancers over the last few decades, due to decrease in smoking 35 rates, a sharp increase in the incidence of oropharyngeal squamous cell carcinomas (OPSCCs) has been 36 observed over the same period in several developed countries worldwide [5]. The increase in OPSCC 37 incidence rates has been causally linked to the growing prevalence of HPV infections presumably acquired 38 via increased oral exposure to infected anogenital sites with changing sexual behaviour [6].

39
In the United States (US), the annual incidence rate of OPSCCs has now surpassed that of cervical cancer, 40 with similar trends projected for other developed countries, there is a growing concern for managing disease

44
Unlike cervical cancer, precursor lesions are not associated with OPSCCs, but it is likely that a subclinical 45 HPV infection that persists for decades precedes the development of these cancers similar to cervical cancer.

46
The key parameters that govern the natural history of oral HPV infection (transition from infection to 47 malignancy) remain largely ill-defined because they cannot be easily inferred from experimental data.

48
While several population based studies have reported oral HPV infections prevalence rates at 4-7.0% in 49 healthy populations with similar associated risk factors as the ones reported for OPSCCs including gender, 50 sexual behaviour and current tobacco use, limited data exists on oral HPV persistence and clearance rates 51 and associated risk factors [10,11].

52
Unravelling the trajectory of oral HPV infections and associated risk factors that promote malignant 53 transformation is crucial to the development of successful primary and secondary prevention strategies.

4
Mathematical models have been previously been used successfully to estimate some of these ill-defined 55 parameters in cervical cancer [12]. A comprehensive mathematical model incorporating data on oral HPV 56 prevalence, persistence and/or clearance rates as well as demographics, lifestyle risk factors and sexual 57 history will help to delineate the natural history of HPV in OPSCCs and associated risk factors.

58
Our aim is to estimate the conditional probability of developing OPSCCs following an oral HPV infection 59 along with socio-demographic and other covariate information. To achieve this, we developed a novel 60 modelling approach, which calibrates a machine learning method using inverse conditional probabilities.

61
We made use of the Bayesian Additive Regression Trees (BART) machine learning approach as it allows 62 for complex non-linear interactions between variables and produces probabilistic confidence and prediction 63 intervals. We stress that our approach is a proof-of-concept and outline some possible extensions.

66
The first dataset is from the Surveillance, Epidemiology, and End Results (SEER) program which provides        from the SEER data. This model is then inverted using Bayes' theorem to reverse the probability 98 relationship. The inversion involves corrections using incidence values which are obtained from both SEER 99 and NHANES data. We have made our code available at www.github.com/andrewcparnell/HNSCC for 6 those wishing to confirm or extend our analysis. However we note that the SEER oral HPV data are 101 restricted and not included in the Github repository.

103
We define as the event that an individual has OPSCC and as the event that an individual has an oral 104 HPV infection. We define to be the covariate values for an individual on covariate , with = 1, … , 105 covariates, and write to be the set of all covariates for an individual. In practice = 4 with 106 representing, respectively, age, sex, marital status and race/ethnicity.

108
We use Bayes' theorem to invert the probabilities and obtain our desired goal ( | , ) for an individual.

109
That is, the probability that an individual has OPSCC given that they have an oral HPV infection and

137
The key component of BART is that the parameters are set to be a sum of regression trees, i.e.:

165
We subsequently summarise these probabilities to produce 95% posterior uncertainty intervals.

167
Study Population

173
The NHNAES study population included 9,134 respondents with a confirmed oral HPV data.     infection. However, we observe a substantial increase in the number of cases for women >60 and a decline 208 in the number of OPSCC cases in males aged >60 and above.

210
We have developed a novel mathematical modelling approach -a "double Bayes" that inverts the results

221
Despite the mathematical novelty of our approach, and in particular the ability to invert the association 222 between OPSCC and HPV, we foresee several difficulties and issues with the results we present. The first  the time lag similar to HIV natural history models [19]. A second issue relates to the uncertainty estimates 12 for the incidence of OPSCC or HPV. If these were available, we could propagate them through our inversion 232 technique to produce more conservative uncertainty intervals. Currently we believe our results, which we 233 stand over as a best first guess, underestimate the interval widths at any given confidence level.

234
A third limitation is the cross-categorisation of various covariates collected as part of the analysis. Between

269
We are grateful to the SEER HPV database for providing access to the restricted HPV data.

270
EK and ACP were paid for a portion of this work through HRB award ICE-2015-1037.