Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets than those used in previous benchmarks. In summary, we identify, formalize, and validate fundamental limitations of VAE-based approaches for modeling weakly-supervised data and discuss implications for real-world applications.
翻译:多式多式自动转换器(VAE)作为受监管不力的数据的有效遗传模型显示了前景;然而,尽管监督不力,它们与完全无人监督的单式自动转换器相比,在基因质量上存在差距;为了解释这一差距,我们发现了一个基本限制,适用于以混合物为基础的多式联运多式自动转换器(VAE)的大家庭。我们证明,对模式的次抽样对多式联运ELBO施加了不可取的上限,从而限制了各自模型的基因化质量。我们生动地展示了合成数据和真实数据的基因化质量差距,并介绍了多式联运VAE的不同变体之间的权衡。我们发现,在对比以往基准中所使用的更复杂的数据集应用时,现有方法都没有达到有效的多式转换模型的所有理想标准。 总之,我们确定、正式确定和验证基于VAE的模型数据模型的根本性局限性,并讨论对现实世界应用的影响。