Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
翻译:与以往的序列处理结构(如RNNs)相比,变异器的性能表现显示有所改进。 尽管最近建议该模型取得了相当大的性能收益, 但是其计算成本很高, 培训成本很高, 并且有很高的参数预算。 有鉴于此, 我们探索了变异器中的参数共享方法, 具体侧重于基因模型。 我们对不同的参数共享/ 减少方法进行了分析, 并开发了子变异器。 我们的模型结合了三明治式的参数共享, 这克服了在变异模型中的天真的跨层参数共享, 以及自我加速嵌入因子化( SAFE ) 。 关于机器翻译、 抽象合成和语言建模的实验显示, 子变异器即使使用少得多的参数, 也能超过变异器 。