Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges, and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). We discuss how the main biological and experimental factors of the AIRR domain may influence the learned biomarkers and provide easily adjustable simulations of such effects. In conclusion, we find that causal modeling improves machine learning-based biomarker robustness by identifying stable relations between variables and by guiding the adjustment of the relations and variables that vary between populations.
翻译:机器学习越来越多地用于从高维分子数据中发现诊断和预测生物标志。然而,与实验性设计有关的各种因素可能会影响学习一般和临床应用诊断的能力。在这里,我们争辩说,因果观点可以改善对这些挑战的识别,并正式确定它们与基于机学习的诊断的稳健性和普遍性的关系。为了进行具体讨论,我们侧重于最近建立的具体的高维生物标志――适应性免疫受体反应器。我们讨论了AIRR领域的主要生物和实验因素如何影响所学的生物标志和提供易于调整的此类影响模拟。最后,我们发现,因果模型通过查明变量之间的稳定关系,并通过指导不同人群之间的关系和变量的调整,改善了基于机学的生物标志的稳健性。