Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that the early self-attention layers in Transformers still focus on local patterns and bring minor benefits in recent hierarchical vision Transformers. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at: https://github.com/zhuang-group/LIT
翻译:变异器已成为深层学习的主要结构之一,特别是作为计算机视觉中革命性神经网络(CNNs)的强大替代物;然而,变异器培训和先前工程的推论可能过于昂贵,因为对一系列长的表示序列,特别是高分辨率密集的预测任务,自我注意是四分法复杂的,因此,变异器已成为一种主导结构之一;为此,我们提出了一个新颖的减少关注的VISION变异器(LIT),其基础是,变异器的早期自我注意层仍然以当地模式为重点,给最近的等级视觉变异器带来微小的好处。具体地说,我们提议一个等级变异器,在早期使用纯多层透视器(MLPs)对丰富的本地模式进行编码,同时应用自我注意模块来捕捉更深层的长期依赖性。此外,我们进一步提议一个学习到的变形符号合并模块,以非统一的方式将适应性的信息连接成的散块。拟议的变异体使图像识别任务,包括图像分类、天体探测和实例分割,作为许多视觉任务的强大后台。我们可以在许多视野组中找到。