The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.
翻译:变换器翻译模式以多头关注机制为基础,这种机制可以很容易地平行。多头关注网络平行地执行折叠的点产品关注功能,通过在不同位置共同关注不同代表子空间的信息,赋予模型权力。在本文中,我们提出了一个方法来学习一种硬性检索关注,即关注对象只关注句中的一个符号,而不是所有符号。因此,标准折叠点产品关注中的注意概率和价值序列之间的矩阵倍增可以用简单有效的检索操作来取代。我们表明,我们硬性检索关注机制在解码方面速度是1.43倍,同时在解码器的自我和交叉注意网络中使用时,对范围广泛的机器翻译任务保持翻译质量。