Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in self-supervised, generative learning tasks.
翻译:在描述现实世界现象和从中学习时自然会发现多种数据类型,这是机器学习研究的一个长期目标,然而,现有自我监督的基因模型似乎无法满足多式联运模型的所有理想要求:其后近似功能导致语义一致性与学习联合数据分配能力之间的取舍。我们为克服这些限制的多式联运数据提出了一个新的通用的ELBO配方。新目标包含前两种方法,作为特殊案例,并且不加妥协地将其利益结合起来。在广泛的实验中,我们展示了拟议的方法相对于在自我监督、基因化学习任务中最先进的模型的优势。