重新审视视觉语言模型中的多模态位置编码 (Revisiting Multimodal Positional Encoding in Vision-Language Models)

Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

翻译：多模态位置编码对于视觉语言模型至关重要，然而目前对多模态位置编码的系统性研究仍较为缺乏。本文通过对旋转位置嵌入（RoPE）的两个核心组成部分——位置设计与频率分配——进行综合分析，开展了对多模态RoPE的全面研究。通过大量实验，我们总结出三个关键指导原则：位置一致性、全频率利用以及文本先验保持，以确保布局明确性、表征丰富性以及从预训练大语言模型中的忠实迁移。基于这些发现，我们提出了多头RoPE（MHRoPE）和交错式MRoPE（MRoPE-I）两种简单即插即用的变体，无需改变模型架构。我们的方法在多种基准测试中均持续优于现有方法，在通用多模态理解和细粒度多模态理解任务上均取得了显著提升。代码将在 https://github.com/JJJYmmm/Multimodal-RoPEs 发布。