Missing data persists as a major barrier to data analysis across numerous applications. Recently, deep generative models have been used for imputation of missing data, motivated by their ability to capture highly non-linear and complex relationships in the data. In this work, we investigate the ability of deep models, namely variational autoencoders (VAEs), to account for uncertainty in missing data through multiple imputation strategies. We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations, particularly for more extreme missing data values. To overcome this, we employ $\beta$-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification. Assigning a good value of $\beta$ is critical for uncertainty calibration and we demonstrate how this can be achieved using cross-validation. In downstream tasks, we show how multiple imputation with $\beta$-VAEs can avoid false discoveries that arise as artefacts of imputation.
翻译:缺少的数据是许多应用中数据分析的主要障碍。最近,由于能够捕捉数据中高度非线性和复杂的关系,在计算缺失数据时使用了深重的基因模型。在这项工作中,我们调查深层模型的能力,即变式自动计算器(VAEs),以便通过多重估算战略来计算缺失数据的不确定性。我们发现,VAEs对缺失数据的经验覆盖不足,低估和过度信任的估算,特别是对于更极端的缺失数据值。为了克服这一点,我们使用了从一个通用的海湾框架中查看的$\beta$-VAEs,为模型的确定提供了稳健性。给不确定性校准指定一个好的值$\beeta$是关键,我们证明如何利用交叉校准来实现这一目标。在下游任务中,我们展示使用$\beta$-VAE的多重估算方法可以避免作为估算工艺产生的虚假发现。