Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). Through simulations, we illustrate how major biological and experimental factors of the AIRR domain may influence the learned biomarkers. In conclusion, we argue that causal modeling improves machine learning-based biomarker robustness by identifying stable relations between variables and by guiding the adjustment of the relations and variables that vary between populations.
翻译:机器学习越来越多地用于从高维分子数据中发现诊断和预后生物标志物。然而,与实验设计相关的各种因素可能会影响学习可推广和临床适用的诊断。在这里,我们认为以因果透视改善了识别这些挑战并形式化了它们与基于机器学习的诊断的强度和泛化能力的关系。为了进行具体讨论,我们重点关注了一个特定的、最近建立的高维生物标记物——适应性免疫受体库(AIRRs)。通过模拟,我们说明了AIRR领域的主要生物学和实验因素如何影响学习到的生物标志物。总之,我们认为因果建模通过识别变量之间的稳定关系以及指导那些因人口而异的关系和变量的调整,提高了基于机器学习的生物标志物的鲁棒性。