Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of "related" multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal VAE models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
翻译:基因模型的多模式学习往往是指以多种模式,例如愿景和语言,从信息的共同点,从信息的共同点中学习抽象概念,例如,视觉和语言。虽然事实证明,这种模式的训练对于学习通用的表述方式十分有效,但这类模型的培训往往需要大量“相关”的多式联运数据,这些数据具有共性,而这种数据可能成本高昂。为了减轻这一影响,我们为基因模型学习开发了一个新的对比性框架,不仅通过模式之间的共同点,而且通过“相关”和“非相关”的多式联运数据之间的区别来培训模型。 我们在实验中显示,我们的方法可以使数据高效的多式联运学习对多种多式VAE模式模式具有挑战性的数据集进行。我们还表明,在我们提议的框架下,基因模型能够准确地识别来自不相干模式的相关样本,从而能够利用富含标签的、未贴标签的多式数据。