Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.
翻译:在从多种模式中学习演示时,越来越多地使用相对性损失。在极限方面,对比性损失的性质鼓励了在潜在空间中完全匹配的模式。然而,模式调整如何影响下游任务绩效仍然是一个未决问题。在本文中,基于信息理论的论据,我们首先证明,对于下游预测任务而言,精确的模式调整总体上不尽理想。因此,我们主张,更好的绩效的关键在于有意义的潜在模式结构,而不是完全的模式调整。为此,我们提出了三种构建潜在模式结构的一般方法。具体地说,我们设计了1) 内部模式正规化的深度特征分离损失;2) 内部模式和模式正规化的棕色-断层损失;3) 内部和模式正规化的几何一致性一致性损失。我们首先对两种流行的多模式学习框架进行了广泛的实验:基于两维模式的CLIP模型和基于ALBEF的组合模型。我们测试了各种任务的模式,包括零/few图像分类、图像文字检索、视觉问题解析、视觉思维、视觉思维和视觉正规化的现有方法,以及视觉要求实现现行方法的一致性。</s>