Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply $\ell_2$ normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT'15.
翻译:低资源语言翻译是一项具有挑战性但具有社会价值的NLP任务。 在使变异器正常化适应这一环境的近期工作的基础上,我们建议QKNorm(QKNorm),这是一种改变关注机制的正常化技术,使软负功能更容易被任意饱和,同时又不牺牲表达性。具体地说,我们在每个查询和关键矩阵的头部方面应用$\ ell_ 2$的正常化,然后将其乘以一个可学习的参数,而不是以嵌入层面的平方根来扩大。我们显示,在TED Talk Champ和IWSLT'15 的5对低资源翻译中,比最先进的双语基准平均提高了0.928 BLEU。