Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. However, vanilla Transformer mainly exploits the top-layer representation, assuming the lower layers provide trivial or redundant information and thus ignoring the bottom-layer feature that is potentially valuable. In this work, we propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words. To corroborate the effectiveness of the proposed method, extensive experiments and analytic experiments are conducted on three bilingual translation benchmarks and two multilingual translation tasks, including the IWLST-14, IWLST-17, LDC, WMT-14 and OPUS-100 benchmark. Experimental and analytical results demonstrate that our model outperforms its Transformer counterparts by a consistent gain. Furthermore, it can be successfully scaled up to 60 encoder layers and 36 decoder layers.
翻译:由一组编码器和解码器网络层堆叠在一起的变压器结构在神经机翻译方面取得了显著的发展,然而,香草变压器主要利用顶层代表制,假定下层提供微不足道或多余的信息,从而忽视潜在有价值的底层特征。在这项工作中,我们提议了集团-变压器模型(GTrans),该模型将编码器和解码器的多层表示方式灵活地分为不同的组,然后将这些组的特性结合成目标词。为了证实拟议方法的有效性,对三种双语翻译基准和两种多语言翻译任务进行了广泛的实验和分析实验,包括IWLST-14、IWLST-17、最不发达国家、WMT-14和OPUS-100基准。实验和分析结果表明,我们的模型通过一致的收益超越了变压器的对应方。此外,它可以成功地扩大到60个编码层和36个解码层。