Real world datasets often contain entries with missing elements e.g. in a medical dataset, a patient is unlikely to have taken all possible diagnostic tests. Variational Autoencoders (VAEs) are popular generative models often used for unsupervised learning. Despite their widespread use it is unclear how best to apply VAEs to datasets with missing data. We develop a novel latent variable model of a corruption process which generates missing data, and derive a corresponding tractable evidence lower bound (ELBO). Our model is straightforward to implement, can handle both missing completely at random (MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs and gives both the VAE encoder and decoder principled access to indicator variables for whether a data element is missing or not. On the MNIST and SVHN datasets we demonstrate improved marginal log-likelihood of observed data and better missing data imputation, compared to existing approaches.
翻译:真实的世界数据集往往包含缺少元素的条目,例如医疗数据集中,患者不太可能接受所有可能的诊断测试。变式自动编码器(VAE)是常用的基因模型,经常用于不受监督的学习。尽管它们被广泛使用,但不清楚如何最好地将变式编码器应用到缺少数据的数据集中。我们开发了一个产生缺失数据的腐败过程的新颖潜伏变量模型,并得出相应的可移植证据(ELBO ) 。我们的模式可以直接实施,可以完全随机(MCAR)处理,也可以不随机(MCAR)数据处理完全丢失,或者不随机(MNAR)数据、高维度输入尺度(MNAR)数据,并允许 VAE 编码器和解码器对指标变量的有原则性访问,以确定数据元素是否缺失。在MNIST 和 SVHN 数据集上,我们比现有方法展示了观测到的数据的边际日志和更好的缺失数据估算。