We develop a fully non-parametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. We thus refer to it as "Projected Kullback-Leibler MCAR" (PKLM) test. The use of random projections makes it applicable even if very few or no fully observed observations are available or if the number of dimensions is large. An efficient permutation approach guarantees the level for any finite sample size, resolving a major shortcoming of most other available tests. Moreover, the test can be used on both discrete and continuous data. We show empirically on a range of simulated data distributions and real datasets that our test has consistently high power and is able to avoid inflated type-I errors. Finally, we provide an R-package PKLMtest with an implementation of our test.
翻译:在数据集的缺失机制上,我们开发了完全非参数、易用和强力的测试,对完全失踪的数据集的随机(MCAR)假设进行完全非参数、易用和强力的测试。测试比较了数据可变空间中随机预测的不同缺失模式的分布。分布差异与 Kullback-Lebeler 差异使用概率随机森林测量。因此我们称之为“ 预测 Kullback- Leibel MCAR(PKLM) ” 测试。随机预测的使用使得它可以适用,即使很少或没有完全观测到的观测,或者尺寸很大。高效的调整方法保证了任何有限样本大小的分布,解决了大多数其他可用测试的重大缺陷。此外,测试可以同时用于离散和连续的数据。我们从经验上展示了一系列模拟数据分布和真实数据集,我们的测试具有持续的高功率,能够避免错误。最后,我们提供了用于测试的R组合 PKLMSTT(R) 。