Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.
翻译:嵌入模型是现代人工智能的基石。在Multimodal Large Language Models (MLLMs) 的推动下,其在架构与数据整理方面取得了显著进展,然而整体范式仍局限于SSC(单一输入、单一嵌入、对比监督),即将丰富、多面的输入压缩为单一嵌入,未能充分利用MLLM的潜力。本文针对多模态嵌入学习,设计了一种并行解耦框架(Parallel Decoupling Framework, PDF),通过利用MLLM特有的可引导性——即其在明确指令下灵活生成差异化响应的能力。具体而言,PDF在共享的MLLM主干网络上,基于不同的可学习前缀为单一输入展开多个并行路径,并依赖这些路径获取并行嵌入。为促进充分的并行多样性,我们采用互信息最小化(Mutual Information Minimization, MIM)作为显式约束,并结合逐路径对比监督以保持语义对齐。这种双重目标迫使PDF产生鲁棒的语义覆盖与可泛化的嵌入空间。最终,在推理阶段仅需一次前向传播即可访问这一卓越的嵌入空间,计算开销可忽略不计。我们在多个MLLM主干上实例化了PDF,并在MMEB基准上验证了其有效性。在不同分辨率与模型规模下均取得显著提升,例如:VLM2Vec-LLaVA-1.6-LR模型提升了+8.9%(7B),而VLM2Vec-Qwen2VL模型分别提升了+4.2%(2B)与+3.1%(7B)。在效率方面,我们的2B模型仅使用一半计算预算,即超越基线+2.6%。