Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
翻译:多模态检索旨在跨文本或图像等模态检索相关内容,其应用涵盖从人工智能搜索到内容生产的广泛领域。尽管如CLIP等采用分离编码器的方法通过对比学习对齐模态特定嵌入已取得成功,但近期出现的多模态大语言模型(MLLMs)使得能够直接处理组合输入的统一编码器成为可能。虽然这种统一编码器具备灵活性与先进性,我们发现采用传统对比学习训练的统一编码器容易学习模态捷径,导致在分布偏移下鲁棒性下降。为此,我们提出一种模态组合感知框架以缓解该问题。具体而言,通过偏好损失强制多模态嵌入优于其单模态对应部分,同时利用组合正则化目标将多模态嵌入与其单模态部件组合的原型进行对齐。这些目标显式建模了组合表征与其单模态对应部分之间的结构关系。在多种基准测试上的实验结果表明,该方法在分布外检索任务中取得显著提升,证明当采用MLLMs作为统一编码器时,模态组合感知是构建鲁棒组合多模态检索系统的有效原则。