We develop a fully non-parametric, fast, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a data set. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. We thus refer to it as "Projected Kullback-Leibler MCAR" (PKLM) test. The use of random projections makes it applicable even if very little or no fully observed observations are available or if the number of dimensions is large. An efficient permutation approach guarantees the level for any finite sample size, resolving a major shortcoming of most other available tests. Moreover, the test can be used on both discrete and continuous data. We show empirically on a range of simulated data distributions and real data sets that our test has consistently high power and is able to avoid inflated type I errors. Finally, we provide an R-package \texttt{PKLMtest} with an implementation of our test.
翻译:我们开发了一个完全非参数、快速、易用和强大的测试,用于对数据集缺失机制的随机完全丢失的假设(MCAR) 。 测试比较了数据可变空间中随机预测的不同缺失模式的分布。 分布差异与 Kullback- Leibel differgence 使用概率随机森林测量。 因此, 我们把它称为“ 预测 Kullback- Leibel MCAR( PKLM) ” 测试。 随机预测的使用使得它适用, 即使很少或没有完全观测到的观测, 或尺寸很大。 高效的调整方法保证了任何有限样本大小的分布, 解决了大多数其他可用测试的重大缺陷。 此外, 测试可以同时用于离散和连续的数据。 我们从一系列模拟数据分布和真实数据集中实验显示, 我们的测试能力始终很高, 并且能够避免错误的I型错误。 最后, 我们提供了一个 R- package \ textt{PLMstest} 并实施了测试 。