Background: Existing guidelines for handling missing data are generally not consistent with the goals of prediction modelling, where missing data can occur at any stage of the model pipeline. Multiple imputation (MI), often heralded as the gold standard approach, can be challenging to apply in the clinic. Clearly, the outcome cannot be used to impute data at prediction time. Regression imputation (RI) may offer a pragmatic alternative in the prediction context, that is simpler to apply in the clinic. Moreover, the use of missing indicators can handle informative missingness, but it is currently unknown how well they perform within CPMs. Methods: We performed a simulation study where data were generated under various missing data mechanisms to compare the predictive performance of CPMs developed using both imputation methods. We consider deployment scenarios where missing data is permitted/prohibited, and develop models that use/omit the outcome during imputation and include/omit missing indicators. Results: When complete data must be available at deployment, our findings were in line with widely used recommendations; that the outcome should be used to impute development data under MI, yet omitted under RI. When imputation is applied at deployment, omitting the outcome from the imputation at development was preferred. Missing indicators improved model performance in some specific cases, but can be harmful when missingness is dependent on the outcome. Conclusion: We provide evidence that commonly taught principles of handling missing data via MI may not apply to CPMs, particularly when data can be missing at deployment. In such settings, RI and missing indicator methods can (marginally) outperform MI. As shown, the performance of the missing data handling method must be evaluated on a study-by-study basis, and should be based on whether missing data are allowed at deployment.
翻译:处理缺失数据的现有准则一般不符合预测建模的目标,即缺少的数据可在模型管道的任何阶段出现。多重估算(MI)通常被称作黄金标准方法,在诊所中可能具有挑战性。显然,结果不能用于预测时估算数据。回归估算(RI)可能为预测提供一种实用的替代方法,在诊所应用得更简便。此外,使用缺失指标可以处理信息缺失问题,但目前尚不清楚这些指标在模型管道的任何阶段的运行情况。方法:我们进行了模拟研究,根据各种缺失数据机制生成数据,以比较利用估算方法开发的氯碱标准方法的预测性业绩。我们考虑的是,在预测时无法使用缺失数据的部署情景,开发错误指标。结果:如果在部署时必须提供完整数据,我们的调查结果应该符合广泛使用的建议;结果应该用于估算MI下的发展数据,但在模型中却没有被忽略。当使用错误数据处理方法时,在常规的部署时,在错误数据配置时必须采用有害数据处理方法,在错误的排序时,在错误的排序中必须提供具体的数据排序。