Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.
翻译:密度比估计(DRE)是一个重要的机器学习技术,有许多下游应用。我们认为DRE的挑战在于没有随机(MNAR)数据。在这个背景下,我们表明使用标准的DRE方法会导致偏差结果,而我们的提案(M-KLIEP)是修改流行的DRE程序KLIEP, 恢复一致性。此外,我们为M-KLIEP提供了有限的样本估计误差界限,这显示了样本大小和最坏的缺失情况两方面的微小最佳性。然后,我们将DRE、Neyman-Pearson(NP)分类(NP)的重要下游应用适用于MNAR设置。我们的程序既控制了I型错误,也实现了高能量,概率很高。最后,我们展示了模拟缺失的合成数据和真实世界数据有希望的经验性表现。