Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.
翻译:在许多领域,包括流行病学和医学领域,现在广泛采用机器学习(ML)方法进行数据分析。为了应用这些方法,通常必须首先消除混乱,在应用ML之前,以典型的方式通过线性回归消除差异。在这里,我们展示了这种共同的方法来混淆消除偏差 ML 模型,从而导致误导结果。具体地说,这种共同的分解方法可能会泄漏信息,以至于当随后采用非线性 ML 方法时,无效或中度效应会扩大为近乎效果的预测。我们确定并评估了这种汇合渗漏的可能机制,并为减轻其负面影响提供了实用的指导。我们通过分析临床数据集,对预测注意力偏差超动性障碍(ADHD)和抑郁症的准确度估计过高,从而证明混结性疏漏对现实世界的重要性。我们的结果对ML 工作流程的实施和部署具有广泛的影响,并告诫人们警惕“不使用标准相混淆的清除方法”。