Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. This comes with a significant computational overhead, as the attention mechanism scales with a quadratic complexity in sequence length. Efficient transformer variants have received increasing interest from recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train or yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving the efficiency while retaining the accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process needs lower training cost than training these recurrent variants from scratch. As many recent models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.
翻译:在自然语言生成中,变异器优于常规神经网络(RNNS),在自然语言生成中,优于常规神经网络(RNNS),这需要大量的计算间接费用,因为关注机制在顺序长度上具有四分复杂度。高效变异器从最近的作品中获得了越来越多的兴趣。其中,线性复杂反复变异器被证明非常适合自动递减生成。它与随机化或超常特征地图相近,但很难培训或产生亚优性精度。这项工作的目的是将预先培训的变异器转换为高效的经常对等,提高效率,同时保留准确性。具体地说,我们提议了对正对式变异器的转换程序:在现成的变异器中,我们用其线性兼容性复变异器取代软式的注意,然后进行微调。我们的方法在标准变异器和其他经常变异器的效率和准确性之间提供了更好的权衡。我们还表明,微调整过程所需要的培训成本要低于从零到这些重复变异器的培训。由于许多最近的自然语言变异器模型越来越依赖高的不断的变压式,因此,在大规模变压前的变换过程中越来越需要。