We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. A major approach to achieve this objective is to train a model that integrates all the information of different modalities into a joint representation and then to generate one modality from the corresponding other modality via this joint representation. We simply applied this approach to variational autoencoders (VAEs), which we call a joint multimodal variational autoencoder (JMVAE). However, we found that when this model attempts to generate a large dimensional modality missing at the input, the joint representation collapses and this modality cannot be generated successfully. Furthermore, we confirmed that this difficulty cannot be resolved even using a known solution. Therefore, in this study, we propose two models to prevent this difficulty: JMVAE-kl and JMVAE-h. Results of our experiments demonstrate that these methods can prevent the difficulty above and that they generate modalities bi-directionally with equal or higher likelihood than conventional VAE methods, which generate in only one direction. Moreover, we confirm that these methods can obtain the joint representation appropriately, so that they can generate various variations of modality by moving over the joint representation or changing the value of another modality.
翻译:我们只是对不同的自动转换器(VAE)采用了这种方法,我们称之为联合多式联运变式自动编码器(JMVAE)。然而,我们发现,当该模型试图产生一个在输入过程中缺失的大型维模式时,联合代表器崩溃和这种模式无法成功产生。此外,我们确认,即使使用已知的解决办法,这一困难也不可能得到解决。因此,我们在本研究中提出了两种模式来防止这一困难:JMVAE-kl和JMVAE-h。我们的实验结果表明,这些方法可以防止上述困难,它们产生与常规VAE方法相同或更高的可能性的双向模式,只能产生一个方向。我们确认,这些方法可以适当地改变联合代表模式或另一种模式。