The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample size, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable without imputing missing values. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE with state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer). Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available.
翻译:真实世界数据集中缺失值的普遍存在给统计推断带来了挑战,并可能阻止在同一研究中对类似数据集进行分析,使许多现有数据集无法用于新的分析。虽然为数据估算开发了大量的软件包和算法,但绝大多数都表现不佳,如果有许多缺失值和低抽样规模,不幸的是,这是经验数据的共同特征。这种低准确性估计对下游统计模型的性能产生了不利影响。我们开发了一个统计推论框架,用以预测目标变量,而不计算缺失值。我们的框架,即RIFLE(通过低序时的动画, Robust Infert Inference ), 估计了低序时段和相应的信任间隔,以学习一个分布稳健的模型。我们专门将我们的框架用于线性回归和正常的对比分析,我们提供趋同和绩效保证。这个框架还可以调整为对缺失数据进行预测。在数字实验中,我们将RIFLE与最新方法(包括MI、Amelia、MissFest、KNNF-imbrestress-imter), 当我们缺少其他基准点是相对的RIFA/IMIFA, 当我们缺少数据时,而缺少的高级数据是SBIFIFDA/ILADRILADRILA/IA, 高时, 时, 高级数据是SDADADADADADADRBDADADA的相对的代数。