Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive sampling is differentiable. When combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look. The proposed PS-ViT is both effective and efficient. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy with about $4\times$ fewer parameters and $10\times$ fewer FLOPs. Code is available at https://github.com/yuexy/PS-ViT.
翻译:具有强大的全球关系建模能力的变异器最近被引入了基本的计算机视觉任务。 典型的例子之一是, 视觉变异器(VIT)直接在图像分类上应用纯变压器结构, 简单地将图像分割成固定长度的图象, 并使用变异器来学习这些图象之间的关系。 然而, 这种天真化的象征化可以摧毁物体结构, 将网格分配到背景等不感兴趣的区域, 并引入干扰信号。 为了缓解上述问题, 我们在本文件中提议了一个迭代和渐进的取样战略, 以定位歧视区域。 在每次迭代中, 将当前取样步骤的嵌入输入到变异器编码层, 并预测一组抽样抵消来更新下一个步骤的取样位置。 渐进式取样是不同的。 当与视觉变异器一起, 获得的 PS- VIT 网络可以适应性地学习哪里。 拟议的PS- ViT 既有效又高效。 从图像网的抓痕中, PS- ViT 运行3.8% 高于Villa Vitt,, 其最高/ ViPS- 准确值为最高值值值为4\ VIPS- Vius min 值为10_ Vius 值的参数。 laxxxmus la la la la la lax la la lax mess lax la dest labs fears fours fours fours fours fours fours fas fours fours fours fours= fours fours fours fours fours. fours fours fours fours fours fours. fours/ fours. fours. fours. fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours fours.