Dot-product attention is a core module in the present generation of neural network models, particularly transformers, and is being leveraged across numerous areas such as natural language processing and computer vision. This attention module is comprised of three linear transformations, namely query, key, and value linear transformations, each of which has a bias term. In this work, we study the role of these bias terms, and mathematically show that the bias term of the key linear transformation is redundant and could be omitted without any impact on the attention module. Moreover, we argue that the bias term of the value linear transformation has a more prominent role than that of the bias term of the query linear transformation. We empirically verify these findings through multiple experiments on language modeling, natural language understanding, and natural language generation tasks.
翻译:点产品关注是目前一代神经网络模型的核心模块,特别是变压器,目前正在自然语言处理和计算机视觉等多个领域加以利用。注意模块包括三个线性转换,即查询、关键和数值线性变换,其中每个都有一个偏差术语。在这项工作中,我们研究这些偏差术语的作用,数学上显示关键线性变换的偏差术语是多余的,可以省略而不会对注意模块产生任何影响。此外,我们认为,值线性变换的偏差术语比查询线性变换的偏差术语具有更显著的作用。我们通过语言建模、自然语言理解和自然语言生成任务的多重实验,从经验上核实这些结论。