The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique - designed to overcome the deficiencies in naive cross-layer parameter sharing for generative models - and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
翻译:可以说,变异器的出现是自然语言处理方面许多最近进展背后的驱动力。然而,尽管最近显示,该模型的性能有相当大的改进,但是,尽管其性能有相当大的改进,但模型严重超分,参数效率低,而且培训费用昂贵。受预先培训的深背景化字表解码器中参数共享的成功启发,我们在变异器中探索了参数共享方法,具体侧重于神经机翻译等序列到序列任务中的编码解码模型。我们分析了不同的参数共享/减少方法,并开发了子变异器,一个基于参数的高效变异器模型,将新提出的桑威奇式参数共享技术(旨在克服基因化模型和自我强化嵌入系数化(SAFE)的天性跨层共享缺陷)结合起来。关于机器翻译、抽象合成和语言模型的实验显示,即使使用少得多的参数,该变异器也能超过变异器。