SAS：模拟注意力分数 (SAS: Simulated Attention Score)

Chuanyang Zheng,Jiankai Sun,Yihang Gao,Yuehao Wang,Peihao Wang,Jing Xiong,Liliang Ren,Hao Cheng,Janardhan Kulkarni,Yelong Shen,Atlas Wang,Mac Schwager,Anderson Schneider,Xiaodong Liu,Jianfeng Gao

from arxiv, Tech Report

The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

翻译：注意力机制是Transformer架构的核心组成部分。目前已发展出多种计算注意力分数的方法，包括多头注意力（MHA）、多查询注意力、分组查询注意力等。我们进一步分析了MHA，并观察到在每头隐藏维度保持足够大的前提下，其性能随着注意力头数量的增加而提升。因此，以最小的参数开销同时增加头数和每头隐藏维度，能够以较低成本实现显著的性能增益。基于这一发现，我们提出了模拟注意力分数（SAS）方法，该方法在保持紧凑模型规模的同时，模拟了更多数量的注意力头和每头更高的隐藏特征维度。这是通过将低维度的头表示投影到高维空间实现的，从而在不增加参数量的情况下有效提升了注意力容量。除了头表示之外，我们进一步将模拟方法扩展至键和查询嵌入的特征维度，通过模拟更大模型的行为来增强表达能力，同时保持原始模型规模。为控制参数成本，我们还提出了参数高效注意力聚合（PEAA）方法。在多种数据集和任务上的综合实验验证了所提SAS方法的有效性，相较于不同的注意力变体均取得了显著提升。

相关内容