Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.
翻译:变异器的惊人性能被归因于自省, 每一个位置都考虑整个输入序列之间的依赖性。 在这项工作中,我们改革神经元(n$-gram)模型,该模型只关注每个位置周围的几个代表,多头机制如Vaswani等人( 2017年) 。 通过按顺序排列任务实验,我们发现,用多头神经元( nual $- gram) 取代变异器中的自我关注可以比变异器取得可比或更好的性能。 根据对我们拟议方法的各种分析,我们发现多头神经元( $n$- gram) 是对自我关注的补充, 它们的组合可以进一步改善香草变异器的性能。