Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss, or biased analyses. Since the seminal work of Rubin (1976), there has been a burgeoning literature on missing values with heterogeneous aims and motivations. This has resulted in the development of various methods, formalizations, and tools (including a large number of R packages and Python modules). However, for practitioners, it remains challenging to decide which method is most suited for their problem, partially because handling missing data is still not a topic systematically covered in statistics or data science curricula. To help address this challenge, we have launched a unified platform: "R-miss-tastic", which aims to provide an overview of standard missing values problems, methods, how to handle them in analyses, and relevant implementations of methodologies. In the same perspective, we have also developed several pipelines in R and Python to allow for a hands-on illustration of how to handle missing values in various statistical tasks such as estimation and prediction, while ensuring reproducibility of the analyses. This will hopefully also provide some guidance on deciding which method to choose for a specific problem and data. The objective of this work is not only to comprehensively organize materials, but also to create standardized analysis workflows, and to provide a common ground for discussions among the community. This platform is thus suited for beginners, students, more advanced analysts and researchers.
翻译:与数据合作时,缺失的值是不可避免的。随着来自不同来源的更多数据出现,其出现会更加严重。然而,大多数统计模式和可视化方法都需要完整的数据,而缺乏的数据处理方法则需要在信息损失或偏差分析中产生错误的数据结果。自鲁宾的开创性工作(1976年)以来,关于缺失值的文献不断增多,其目标和动机各异。这导致开发了各种方法、正规化和工具(包括大量的R软件包和Python模块)。然而,对于实践者来说,决定哪种方法最适合他们的问题仍然很困难,部分原因是处理缺失的数据尚不是统计或数据科学课程系统覆盖的一个主题。为了帮助应对这一挑战,我们推出了一个统一的平台:“R-miss-tatic”,该平台旨在概述标准缺失值问题、方法、如何在分析中处理这些缺失的问题,以及相关方法的实施。 同样,我们还在R和Python开发了几个高级管道,以便让学生能够亲手说明如何处理某些统计任务中缺失的值,例如统计和数据科学课程中仍没有系统覆盖。我们已启动了一个统一的平台,从而决定如何进行客观分析,从而确定具体分析。这个方法,从而提供一种常规分析。