As we gain access to a greater depth and range of health-related information about individuals, three questions arise: (1) Can we build better models to predict individual-level risk of ill health? (2) How much data do we need to effectively predict ill health? (3) Are new methods required to process the added complexity that new forms of data bring? The aim of the study is to apply a machine learning approach to identify the relative contribution of personal, social, health-related, biomarker and genetic data as predictors of future health in individuals. Using longitudinal data from 6830 individuals in the UK from Understanding Society (2010-12 to 2015-17), the study compares the predictive performance of five types of measures: personal (e.g. age, sex), social (e.g. occupation, education), health-related (e.g. body weight, grip strength), biomarker (e.g. cholesterol, hormones) and genetic single nucleotide polymorphisms (SNPs). The predicted outcome variable was limiting long-term illness one and five years from baseline. Two machine learning approaches were used to build predictive models: deep learning via neural networks and XGBoost (gradient boosting decision trees). Model fit was compared to traditional logistic regression models. Results found that health-related measures had the strongest prediction of future health status, with genetic data performing poorly. Machine learning models only offered marginal improvements in model accuracy when compared to logistic regression models, but also performed well on other metrics e.g. neural networks were best on AUC and XGBoost on precision. The study suggests that increasing complexity of data and methods does not necessarily translate to improved understanding of the determinants of health or performance of predictive models of ill health.
翻译:随着我们获得关于个人健康的更深入和更广泛的健康相关信息,出现了三个问题:(1) 我们能否建立更好的模型来预测个人健康水平的不健康风险?(2) 我们需要多少数据来有效预测健康不佳?(3) 是否需要采用新方法来处理新数据形式带来的更复杂程度?研究的目的是采用机械学习方法来确定个人、社会、健康相关、生物标志和遗传学数据作为个人未来健康预测器的相对贡献。利用从理解学会(2010-2012年至2015-17年)从英国6830人得出的纵向数据,该研究比较了五类措施的预测性表现:个人(如年龄、性别)、社会(如职业、教育)、健康相关(如身体重量、抓力)、生物标志(如胆固醇、荷尔蒙)和基因单核素多形态(SNPs)的相对贡献。预测性结果变量是限制长期疾病1年和5年的基线。使用两种机器学习方法来建立预测性模型:通过神经精确度网络进行深度的精确度分析,以及X级健康状况的精确度分析模型显示其他健康状况。