优化混合块注意力机制 (Optimizing Mixture of Block Attention)

Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA's performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA's underlying mechanics. Our model reveals that performance critically depends on the router's ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.

翻译：混合块注意力机制（MoBA）（Lu等人，2025）是一种有前景的构建模块，它通过使查询能够稀疏地关注一小部分键值块，显著降低计算成本，从而在大型语言模型中高效处理长上下文。然而，目前对MoBA性能的设计原则理解不足，且缺乏高效的GPU实现，这阻碍了其实际应用。在本文中，我们首先建立了一个统计模型来分析MoBA的内在机制。我们的模型揭示，其性能关键取决于路由器根据查询-键亲和度准确区分相关块与无关块的能力。我们推导出一个信噪比，正式将架构参数与此检索精度联系起来。在我们的分析指导下，我们确定了两个关键的改进途径：使用更小的块大小，以及对键应用短卷积以聚类相关信号，从而提高路由精度。虽然理论上更小的块尺寸更优，但在GPU上效率低下。为了弥合这一差距，我们引入了FlashMoBA，这是一个硬件感知的CUDA内核，即使采用我们理论推荐的小块尺寸，也能实现高效的MoBA执行。我们通过从头训练大型语言模型验证了我们的见解，结果表明我们改进的MoBA模型在性能上匹配了密集注意力基线。对于小块尺寸，FlashMoBA相比FlashAttention-2实现了高达14.7倍的加速，使得我们基于理论的改进具有实际可行性。代码可在以下网址获取：https://github.com/mit-han-lab/flash-moba。