Mamba调制：论Mamba的长度泛化能力 (Mamba Modulation: On the Length Generalization of Mamba)

from arxiv, Accepted to The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025. First two authors contributed equally

The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

翻译：Transformer模型中注意力机制的二次复杂度推动了具有次二次缩放特性的替代架构的发展，例如状态空间模型。其中，Mamba已成为一种领先架构，在一系列语言建模任务中取得了最先进的结果。然而，当应用于超过预训练所见长度的上下文时，Mamba的性能显著下降，显示出对上下文长度扩展的强烈敏感性。通过详细分析，我们将此限制归因于其状态空间动力学的分布外行为，特别是在状态转移矩阵$\mathbf{A}$的参数化中。与近期将这种敏感性归因于离散化时间步长累积项$\exp(-\sum_{t=1}^N\Delta_t)$消失的研究不同，我们建立了输入长度趋近无穷时状态收敛行为与转移矩阵$\mathbf{A}$谱之间的联系，为其在长度扩展中的作用提供了理论依据。接着，为克服这一挑战，我们提出一种方法，通过对预训练Mamba模型应用谱缩放，通过选择性调制每层中$\mathbf{A}$矩阵的谱来实现稳健的长上下文泛化。我们证明，在单纯调制$\Delta_t$失效的场景下，该方法能显著提升性能，从而验证了我们的见解，并为具有结构化转移矩阵的状态空间模型实现更好的长度泛化提供了途径。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日