愿景变异器:从语义分割到感知预测</s> (Vision Transformers: From Semantic Segmentation to Dense Prediction)

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and Pascal Context (55.83% mIoU), and performs competitively on Cityscapes. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification. Our code and models are available at https://github.com/fudan-zvg/SETR.

翻译：图像分类中的视觉变压器(ViTs)的出现改变了视觉代表学习的方法。特别是, ViTs在所有图像补丁层每个层的全可接收场每个层中学习视觉表现, 相比之下,CNN各层和其他替代品( 例如, 大内核和惊人的混凝土)越来越容易接受。在这项工作中,我们首次探索ViTs在全球背景下学习密集视觉预测( 例如, 语义分解)的潜力。我们的动机是,通过在层完全可接收的字段层中学习全球背景, ViTs可以捕捉到更强的远程依赖性信息,对于密集的预测任务至关重要。我们首先显示, 将图像编码成一个补丁序列, 香草ViLT, 没有本地混凝土和分辨率减少, 可以产生更强烈的视觉代表。例如, 我们的模型, 称为Segmentation TRARTER( SETR), 优于 ADE20K( 50.28% mIOU ), 在提交时的测试前台头列的第一个位置位置位置, 可以捕捉取更远距离依赖信息信息信息信息信息信息信息, 对密集的预测环境( 55.83%IL), 在城市直观预测中, 上显示一个高的直观的直观的直观的系统, 直观的直观, 直观, 和直观, 直观, 直观, 直观, 直观的SO- sdeal- sal- sal- sal- sal- sal- laveal- laveal- laxal- sal- laxal- lax</s>