以变换器进行视觉代表学习:从顺序到顺序的视角 (Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective)

Visual representation learning is the key of solving various vision problems. Relying on the seminal grid structure priors, convolutional neural networks (CNNs) have been the de facto standard architectures of most deep vision models. For instance, classical semantic segmentation methods often adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated (i.e., atrous) convolutions or inserting attention modules. However, the FCN-based architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating visual representation learning generally as a sequence-to-sequence prediction task. Specifically, we deploy a pure Transformer to encode an image as a sequence of patches, without local convolution and resolution reduction. With the global context modeled in every layer of the Transformer, stronger visual representation can be learned for better tackling vision tasks. In particular, our segmentation model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission), Pascal Context (55.83% mIoU) and reaches competitive results on Cityscapes. Further, we formulate a family of Hierarchical Local-Global (HLG) Transformers characterized by local attention within windows and global-attention across windows in a hierarchical and pyramidal architecture. Extensive experiments show that our method achieves appealing performance on a variety of visual recognition tasks (e.g., image classification, object detection and instance segmentation and semantic segmentation).

翻译：视觉代表学习是解决各种视觉问题的关键。依靠原始网格结构的前置, 进化神经网络( CNNs) 是大多数深视模型的事实上的标准结构。例如, 古典语义分解方法通常采用全演网络( FCN ), 带有编码器分解结构。编码器逐渐减少空间分辨率, 学习更多可接收域的更抽象的视觉概念。由于上下文建模对于分解至关重要, 最近的努力集中在增加可接收域上, 要么是扩展( 如, 初始) 直径变异或插入关注模块。然而, 以FCN 为基础的结构保持不变。在本文中, 我们的目标是提供另一种观点, 将视觉表示学习作为顺序到顺序的预测任务。具体地说, 我们使用一个纯的变异形转换器将图像编码成一个补丁序列, 没有本地调和分辨率减少。在变形器的每一层中, 以全球背景为模型, 可以学习更强的视觉表达方式, 更好地处理视野任务。 (SE- RODI) 的提交内部, 测试系统, 在模型中, 在模型中, IM 上, 显示我们的直径系统显示我们的直径变变式系统,, 直径变式系统,, 显示我们的的的的直径变式的直径变式的直向结构,,,,, 直径向,,,, 直径向结构。