Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with Möbius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters. Code is available at https://github.com/godlin-sjtu/HyperET
翻译:多模态大语言模型(MLLMs)已成为实现视觉与文本理解对齐的变革性方法。它们通常需要极高的计算资源(例如数千个GPU)进行训练,以实现多粒度级别的跨模态对齐。我们认为,这种低效性的一个关键根源在于其广泛配备的视觉编码器(如CLIP和SAM)缺乏与语言在多粒度层面的对齐。为解决这一问题,本文利用双曲空间,该空间天然建模层次结构,从而为在任意粒度级别上弥合视觉与文本模态之间的粒度差异提供了原则性框架。具体而言,我们提出了一种名为HyperET的高效MLLM训练范式,它通过在双曲空间中动态调整双曲半径,优化视觉表征以在任意粒度级别上与其文本对应物对齐。HyperET采用可学习矩阵配合Möbius乘法运算,通过三种有效配置实现:对角缩放矩阵、块对角矩阵和带状矩阵,提供了一种灵活且高效的参数化策略。在多个MLLM基准测试上的综合实验表明,HyperET以少于1%的额外参数,持续显著提升了现有预训练和微调MLLM的性能。代码发布于https://github.com/godlin-sjtu/HyperET。