We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
翻译:我们提出了Kimi Linear,一种混合线性注意力架构,首次在多种场景(包括短上下文、长上下文和强化学习(RL)扩展机制)的公平比较中超越了全注意力机制。其核心是Kimi Delta Attention(KDA),这是一个表达力强的线性注意力模块,它通过更细粒度的门控机制扩展了Gated DeltaNet,从而更有效地利用有限的有限状态RNN内存。我们定制的分块算法通过一种专门化的对角加低秩(DPLR)转移矩阵变体实现了高硬件效率,与通用的DPLR公式相比大幅减少了计算量,同时更符合经典的delta规则。我们基于KDA与多头潜在注意力(MLA)的层级混合,预训练了一个具有30亿激活参数和480亿总参数的Kimi Linear模型。实验表明,在相同的训练方案下,Kimi Linear在所有评估任务中均以显著优势超越全MLA,同时将KV缓存使用量减少高达75%,并在100万上下文长度下实现高达6倍的解码吞吐量。这些结果表明,Kimi Linear可以作为全注意力架构的直接替代方案,在包括更长输入和输出长度的任务中,提供更优的性能和效率。为支持进一步研究,我们开源了KDA内核和vLLM实现,并发布了预训练和指令调优的模型检查点。