Pre-trained large-scale models provide a transferable embedding, and they show promising performance on diverse downstream tasks. However, the analysis of learned embedding has not been explored well, and the transferability for cross-modal tasks can be improved. This paper provides a perspective to understand multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has two separated embedding spaces for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between the two modalities with less uniformity. As a result, lack of alignment and uniformity might restrict the robustness and transferability of the representation for the downstream task. To this end, we provide a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a \textit{Geodesic Multi-Modal Mixup} that mixes the representation of image and text to generate the hard negative samples on the hyperspherical embedding space. Second, we fine-tune the multi-modal model on hard negative samples as well as normal negatives and positive samples with contrastive loss. Through extensive experiments on retrieval, classification, and structure-awareness task, we demonstrate that our geodesic multi-modal Mixup learns a robust representation and provides improved performance on various downstream tasks.
翻译:经过事先培训的大型模型提供了可转移的嵌入,它们显示了不同下游任务方面的前景。然而,对所学嵌入的分析没有很好地探讨,跨模式任务的可转移性是可以改进的。本文件为理解统一和一致方面的多模式嵌入提供了一个视角。我们新发现,如CLIP等多模式学习模型所学的体现方式为每个混杂数据集提供了两个分离的嵌入空间。此外,两种模式之间还存在未探索的大中间领域,且不统一。因此,缺乏一致和统一可能会限制下游任务代表性的稳健性和可转移性。为此,我们为强势代表提供了一种新的端到端的微调方法,从而鼓励更好的统一和一致性评分。首先,我们建议采用“textit{Geodesicic 多模式混合”的表述方式,将图像和文本的表述方式混杂在一起,以便在超球化嵌入空间上生成硬式的负面样本。第二,我们调整了硬式的跨模式模型可能会限制下游任务代表制的稳健性和可转移性。我们为此提供了一个新的端对硬式的地理认知性代表方式的改进的模型,作为正常和正面的学习任务进行反感化的模型。