Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. However, all approximations thus far have ignored the contribution of the $\textit{value vectors}$ to the quality of approximation. In this work, we argue that research efforts should be directed towards approximating the true output of the attention sub-layer, which includes the value vectors. We propose a value-aware objective, and show theoretically and empirically that an optimal approximation of a value-aware objective substantially outperforms an optimal approximation that ignores values, in the context of language modeling. Moreover, we show that the choice of kernel function for computing attention similarity can substantially affect the quality of sparse approximations, where kernel functions that are less skewed are more affected by the value vectors.
翻译:由于在变形器中对点产品的关注取得了成功,最近提出了许多近似值,以解决其投入长度的二次复杂程度。然而,到目前为止,所有近似值都忽略了$\ textit{ value 矢量} 美元对近似质量的贡献。 在这项工作中,我们主张,研究工作的方向应该是接近关注子层的真正产出,包括值矢量。我们提出了一个有价值认知的目标,并从理论上和经验上表明,一个有价值目标的最佳近似值大大超过一个在语言模型中忽略了值的最佳近似值。 此外,我们表明,为计算关注度而选择内核函数会大大影响微弱的近似值的质量,因为低偏差的内核函数会受到价值矢量的影响更大。