Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modeling. Constructing a predictive model can be thought of as learning a prediction function, i.e., a function that takes as input covariate data and outputs a predicted value. Many strategies for learning these functions from data are available, from parametric regressions to machine learning algorithms. It can be challenging to choose an approach, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task at hand. The super learner (SL) is an algorithm that alleviates concerns over selecting the one "right" strategy while providing the freedom to consider many of them, such as those recommended by collaborators, used in related research, or specified by subject-matter experts. It is an entirely pre-specified and data-adaptive strategy for predictive modeling. To ensure the SL is well-specified for learning the prediction function, the analyst does need to make a few important choices. In this Education Corner article, we provide step-by-step guidelines for making these choices, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience, and guided by theory.
翻译:在流行病学中遇到的共同任务,包括疾病发生率估计和因果推导,都依赖于预测模型。构建预测模型可以被视为学习一种预测功能,即作为投入的共变数据和产出的一种预测值。许多从数据中学习这些功能的战略是存在的,从参数回归到机算学习算法。选择一种方法可能具有挑战性,因为事先无法知道哪一种方法最适合特定数据集和预测任务。超级学习者(SL)是一种算法,可以减轻对选择一种“正确”战略的关切,同时提供考虑其中许多功能的自由,例如合作者建议的、相关研究中使用的或专题专家具体规定的函数。这是完全预先确定和数据调整的预测模型战略。为了确保SL对学习预测功能有明确的定义,分析师需要做出一些重要的选择。在这个“教育科纳”文章中,我们为作出这些选择提供了逐步的指南,让读者通过每一个容易的“正确”战略,通过相关的研究,或由主题专家指定的那些功能来考虑其中的许多功能。它是一种完全预设的和数据调整的战略。为了确保SL对预测的精确性进行我们可能做一个基础的预测,从而提供其基础的顺序,从而提供其推导性分析。