Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.
翻译:变异器在语言和视觉领域都取得了成功。 但是, 将它们缩放到长文件或高清晰度图像等长序列上, 成本实在太高, 实在太高, 因为自我注意机制在输入序列长度方面有四倍的时间和记忆复杂性。 在本文中, 我们提议了长短变异器( Transfred- LS), 这是一种高效的自我注意机制, 用于为语言和视觉任务建模具有线性复杂性的长序列模型。 它将新颖的长距离关注与模型远程关联进行动态预测, 并短期关注捕捉精密的本地相关关系。 我们建议了一种双重正常化战略, 以计算两个注意机制之间的比例错配。 变异器- LS 可以既适用于自动反向和双向变向的模型, 也不增加复杂性。 我们的方法在语言和视觉领域, 包括远程阿伦纳基准、 自动递增语言建模和图像网络分类。 例如, 变异- LS- LS- LS 在英基值 $ 上进行 0. 0.77 的测试, $ 美元 。 双平价, 。 。 在高级 上, 在高级 上, 直径 直径 运行中, 速度进行 速度 速度 速度 速度 和 速度 速度 直径 速度 和 直径 直到 直到 直径 。 。 在前 的 的 直 。 。 。 在 直径 直径 。