This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.
翻译:这项工作对视觉变异器(VIT)的简单而强大的密集预测任务调整器进行了调查。 与最近先进的将特定视觉的感应偏差纳入其结构结构的先进变异器不同,普通VIT由于先前假设薄弱,在密集预测方面表现较差。 为了解决这个问题,我们提议VIT-Adapter, 使普通VIT能够实现与特定视觉变异器的可比较性能。 具体地说, 我们框架中的骨干是一个普通VIT, 能够从大规模多模式数据中学习强有力的表现。 在向下游任务转移时, 使用一个培训前的无感应变异器, 将图像的感应偏差引入模型, 使之适合这些任务。 我们核查VIT-Adapter 多重密度预测任务, 包括对象检测、 实例分割和语义分割。 值得注意的是,我们的VIT- Adapter-L 生成了州- 艺术60.9箱 AP 和 CO- 测试- develad 的53.0 AP 。 我们希望VAdter 能够作为视觉特定变异变码/ACT 的替代研究代码。