Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods work only on complete data, thus requiring preprocessing, such as missing value imputation, to work on incomplete data sets. However, imputation discards potentially useful information encoded by the pattern of missing values. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. We show experimentally that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Nonetheless, MIM can increase variance if many of the added indicators are uninformative, causing harm particularly for high-dimensional data sets. To address this issue, we introduce Selective MIM (SMIM), a method that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM across a range of experimental settings, and improves MIM for high-dimensional data.
翻译:缺少的数据在应用数据科学中很常见,特别是在医疗、社会科学和自然科学中发现的表格数据集中。大多数受监督的学习方法仅对完整数据起作用,因此要求预处理模型(如缺失的数值估算)对不完整的数据集起作用。然而,估算弃弃弃物具有潜在的有用信息,由缺失值模式编码。对于具有信息缺失模式的数据集,增加指标变量以显示缺失模式的变量的失踪指标方法(MIM)可与估算值相结合,以改善模型性能。我们实验性地表明,MIM提高了信息缺失值的性能,我们证明MIM不会对非信息性缺失值的线性模型造成伤害。然而,如果添加的许多指标缺乏信息,特别是高维数据集的伤害,则MIM可能会增加。为了解决这一问题,我们引入了选择性的MIM(SIM)方法,该方法只对具有信息缺失模式的特征添加了缺失指标。我们从经验上表明,SMIM在一系列实验环境中至少表现了MIM,并且改进了高维数据的MIM。