Kimi Linear：一种表达力强、高效的注意力架构 (Kimi Linear: An Expressive, Efficient Attention Architecture)

Kimi Team,Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T. Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du

from arxiv, Kimi Linear tech report

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

翻译：我们提出了Kimi Linear，一种混合线性注意力架构，首次在多种场景（包括短上下文、长上下文和强化学习（RL）扩展机制）的公平比较中超越了全注意力机制。其核心是Kimi Delta Attention（KDA），这是一个表达力强的线性注意力模块，它通过更细粒度的门控机制扩展了Gated DeltaNet，从而更有效地利用有限的有限状态RNN内存。我们定制的分块算法通过一种专门化的对角加低秩（DPLR）转移矩阵变体实现了高硬件效率，与通用的DPLR公式相比大幅减少了计算量，同时更符合经典的delta规则。我们基于KDA与多头潜在注意力（MLA）的层级混合，预训练了一个具有30亿激活参数和480亿总参数的Kimi Linear模型。实验表明，在相同的训练方案下，Kimi Linear在所有评估任务中均以显著优势超越全MLA，同时将KV缓存使用量减少高达75%，并在100万上下文长度下实现高达6倍的解码吞吐量。这些结果表明，Kimi Linear可以作为全注意力架构的直接替代方案，在包括更长输入和输出长度的任务中，提供更优的性能和效率。为支持进一步研究，我们开源了KDA内核和vLLM实现，并发布了预训练和指令调优的模型检查点。