A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
翻译:最近出现了一些变化式自动编码器(VAE),目的是模拟多式联运数据,例如联合模拟图像及其相应的标题。不过,多式VAE往往仅仅侧重于模式的一个子集,例如,在不理会标题的情况下对图像进行配对。我们把这种限制称为模式崩溃。在这项工作中,我们争辩说,这种影响是多式VAE培训期间出现相冲突梯度的结果。我们展示了如何探测计算图中梯度冲突(公正性区块)的子集,以及如何从多式任务学习中利用现有的梯度冲突解决方案来缓解模式崩溃。这就是,确保各种模式的公正优化。我们把培训框架应用于数个多式VAE模型、损失和文献数据集,并从经验上表明,我们的框架大大改进了各种模式之间潜在空间的重建性、有条件生成和一致性。