Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models' efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer's limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice in order to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions.
翻译:循环神经网络是处理序列的有效模型。然而,由于其固有的顺序特性,它们无法学习长期依赖性。作为解决方案,Vaswani等人引入了Transformer,这是一种仅基于注意机制的模型,能够关联输入序列的任意两个位置,因此模拟任意长的依赖关系。Transformer已经改进了许多序列建模任务的最新技术。然而,其有效性是以与序列长度成二次计算和存储复杂度为代价的,这阻碍了其采用。幸运的是,深度学习社区一直致力于提高模型效率,导致了大量的解决方案,如参数共享、修剪、混合精度和知识蒸馏。最近,研究人员通过设计低复杂度的替代方法(如Longformer、Reformer、Linformer和Performer)直接解决了Transformer的限制。然而,由于解决方案种类繁多,研究人员和从业者很难确定要在实践中应用哪些方法,以满足所需的容量、计算和存储之间的权衡。本调查通过调查使Transformer更快并变轻的流行方法,并全面解释方法的优点、局限性和基本假设,以解决这个问题。