Real-world datasets often have missing values associated with complex generative processes, where the cause of the missingness may not be fully observed. This is known as missing not at random (MNAR) data. However, many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. Although there are a few methods that have considered the MNAR scenario, their model's identifiability under MNAR is generally not guaranteed. That is, model parameters can not be uniquely determined even with infinite data samples, hence the imputation results given by such models can still be biased. This issue is especially overlooked by many modern deep generative models. In this work, we fill in this gap by systematically analyzing the identifiability of generative models under MNAR. Furthermore, we propose a practical deep generative model which can provide identifiability guarantees under mild assumptions, for a wide range of MNAR mechanisms. Our method demonstrates a clear advantage for tasks on both synthetic data and multiple real-world scenarios with MNAR data.
翻译:现实世界数据集往往缺少与复杂的基因化过程相关的数值,而这种过程可能无法充分观察到缺失的原因。这并非随机(MNAR)数据,而是已知的缺失(MNAR)数据。然而,许多估算方法没有考虑到缺失机制,导致在有MNAR数据时有偏颇的估算值。虽然有少数方法考虑了MNAR假设,但其模型在MNAR下的可识别性一般得不到保证。也就是说,即使有无限的数据样本,模型参数也不能单独确定,因此这些模型提供的估算结果仍然可能有偏差。许多现代深层基因化模型尤其忽视了这一问题。在这项工作中,我们通过系统分析MNAR数据下的基因化模型的可识别性来填补这一空白。此外,我们提出了一个实用的深层次的基因化模型,可以在宽广的假设下提供可识别性的保证。我们的方法表明合成数据的任务和与MNAR数据有关的多种真实世界情景的明显优势。