Recently, numerous efficient Transformers have been proposed to reduce the quadratic computational complexity of standard Transformers caused by the Softmax attention. However, most of them simply swap Softmax with an efficient attention mechanism without considering the customized architectures specially for the efficient attention. In this paper, we argue that the handcrafted vanilla Transformer architectures for Softmax attention may not be suitable for efficient Transformers. To address this issue, we propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique. The proposed method is validated on popular machine translation and image classification tasks. We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer, but the general accuracy is less comparable. It indicates that the Softmax attention and efficient attention have their own distinctions but neither of them can simultaneously balance the accuracy and efficiency well. This motivates us to mix the two types of attention to reduce the performance imbalance. Besides the search spaces that commonly used in existing NAS Transformer approaches, we propose a new search space that allows the NAS algorithm to automatically search the attention variants along with architectures. Extensive experiments on WMT' 14 En-De and CIFAR-10 demonstrate that our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.
翻译:最近,许多高效的变压器被提出来降低由软体关注引起的标准变压器的二次计算复杂性。 然而,大多数变压器只是将软体转换成高效的注意机制,而没有考虑到专门为高效关注而定制的架构。 在本文中,我们争辩说,用于软体关注的人工制作的香草变压器结构可能不适合高效变压器。为了解决这一问题,我们提出了一个新的框架,以找到高效变压器的最佳结构,同时使用神经结构搜索技术(NAS)来减少性能不平衡。除了在流行的机器翻译和图像分类任务上验证了拟议方法。我们发现,高效变压器的最佳结构的计算方法比标准变压器的计算方法要少,但一般的准确性要低。它表明,软体的注意和高效的注意结构可能不适合于高效的变压器。这促使我们把两种类型的注意结合起来,以减少性能失衡。除了在现有的NAS变压器方法中常用的搜索空间外,我们还提议新的搜索空间,允许NAS-FAR的计算方法能够与标准变压式结构进行可自动搜索。