Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process has lower training cost relative to training these recurrent variants from scratch. As many models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.
翻译:在自然语言生成中,变异器比正常神经网络(RNNS)的正常神经网络(RNNS)要好得多。但是,这需要大量计算成本,因为注意机制的复杂度以序列长度为四倍。高效变异器对最近的工程越来越感兴趣。其中,线性复杂变异器被证明非常适合自动递减生成。它与随机化或超常特征图相近,但可能难以培训软性关注,并可能产生不最优化的准确性。这项工作的目的是将预先训练的变异器转换成高效的经常对等器,提高效率,同时保持准确性。具体地说,我们提议一个交换-正对式变异器程序:在现成的先变异器中,我们用其线性兼容性复变异器取代软性变异体,然后进行微调。我们的方法通过一个有知识的地貌图,改善了标准变异器和其他经常变异器的效率和准确性之间的权衡。我们还表明,微调过程比培训这些经常变异器更低的培训费用。我们建议采用一个具体地说,因为许多自然语言变式变式的变换方法越来越依赖高的变换方法,而不用前的变压法。