The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.
翻译:大量关注的编码器-编码器网络( Transfer)最近由于其有效性而变得占上风。 但是,其解码器的计算复杂性很高,这引起了效率低下的问题。我们通过研究解码器的数学配方,发现在某些温和条件下,可以通过压缩其子层(变换器的基本构件)来简化结构,并实现更高的平行性。我们因此建议压缩的注意网络,其解码器层只有一个子层,而不是三个子层。对14个WMT机器翻译任务的广泛实验显示,我们的模型速度为1.42x,而其性能与强力基线相当。这一强大的基线已经比广泛使用的标准基线快了2x,没有性能损失。