We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.
翻译:我们引入了“谈话头目关注”——多头目关注的变异,其中包括在柔性行动前后的注意头目层面的线性预测。 虽然只插入少量额外参数和少量额外计算,但谈话头目关注导致蒙面语言建模任务更加复杂,在向语言理解和回答问题任务转移学习时质量更高。