Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
翻译:从电子健康记录中获取的保健数据集已证明对评估病人预测数和感兴趣结果之间的关联极为有用,然而,这些数据集往往在高比例的病例中缺少价值,而简单删除这些病例可能带来严重偏差。出于这些原因,提出了若干多种估算算法,以试图恢复缺失的信息。每一种算法都呈现了优缺点,而且目前还没有就多重估算算法在特定情景中最能发挥作用达成共识。此外,选择每种算法参数和与数据有关的建模选择也是关键和具有挑战性的。在本文件中,我们提出了一个新的框架,从数字角度评估在统计分析中处理缺失数据的战略,特别侧重于多重估算技术。我们展示了我们针对国家COVID Cohort 合作公司(N3C) Enclave提供的大量2型糖尿病患者的方法的可行性。我们探索了各种病人特征对与COVID-19有关的结果的影响。我们的分析包括典型的多重估算技术,以及简单、完整应用的全方位分析方法,在统计分析过程中,我们可以有效地展示我们采用不同的方法,从而了解我们采用不同的方法。