This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code are released at \url{https://github.com/microsoft/vision-longformer}.
翻译:本文介绍了一个新的视野变换器(VIT)架构多范围愿景长征,它大大加强了使用两种技术对高分辨率图像进行编码的 VIT\ cite{dosovitskiy202020image} 的 VIT, 使用两种技术对高清晰度图像进行编码。 首先是多尺度模型结构, 提供多种比例的图像编码, 且计算成本可以控制。 第二是视野变换器Longexe的注意机制, 这是一种为自然语言处理而开发的变体, 并实现了线性复杂度( w.r.t. ) 输入符号的数量。 一项全面的经验研究表明, 新的 VIT 明显超越了几个强大的基线, 包括现有的 VIT 模型及其 ResNet 对应方, 以及同时工作的 Pyramidrimid 视野变体, 包括图像分类、 对象探测和分区。 模型和源代码发布在\url{https://github.com/microcrosoft/vision-Longsurent} 。