长短变换器:促进语言和愿景的有效变换器 (Long-Short Transformer: Efficient Transformers for Language and Vision)

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.

翻译：变异器在语言和视觉领域都取得了成功。但是, 将它们缩放到长文件或高清晰度图像等长序列上, 成本实在太高, 实在太高, 因为自我注意机制在输入序列长度方面有四倍的时间和记忆复杂性。在本文中, 我们提议了长短变异器( Transfred- LS), 这是一种高效的自我注意机制, 用于为语言和视觉任务建模具有线性复杂性的长序列模型。它将新颖的长距离关注与模型远程关联进行动态预测, 并短期关注捕捉精密的本地相关关系。我们建议了一种双重正常化战略, 以计算两个注意机制之间的比例错配。变异器- LS 可以既适用于自动反向和双向变向的模型, 也不增加复杂性。我们的方法在语言和视觉领域, 包括远程阿伦纳基准、自动递增语言建模和图像网络分类。例如, 变异- LS- LS- LS 在英基值 $ 上进行 0. 0.77 的测试, $ 美元。双平价, 。。在高级上, 在高级上, 直径直径运行中, 速度进行速度速度速度速度和速度速度直径速度和直径直到直到直径。。在前的的直。。。在直径直径。