Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Project page: http://www.shoufachen.com/adaptformer-page.
翻译:尽管经过培训的愿景变异器(Viet Formers)在计算机愿景方面取得了巨大的成功,但将ViT适应到各种图像和视频任务却具有挑战性,因为它的计算和存储负担沉重,每个模型需要独立和全面地调整,以适应不同任务,限制在不同领域的可转移性。为了应对这一挑战,我们建议对变异器采取有效的适应方法,即“适应Former”,它可以将经过培训的ViT适应到许多不同的图像和视频任务。它拥有比以往艺术更吸引人的几种好处。首先,适应Former引入了轻量模块,这些模块仅给ViT增加不到2%的额外参数,而它能够增加ViT的可转移性,而无需更新最初经过培训的参数,大大超过现有的100%完全调整的行动识别基准模式。第二,它可以在不同的变异器中插上和播放,并且可以缩放到许多视觉任务。第三,在五个图像和视频数据集上的广泛实验显示,适应Former在目标域里基本上改进Vits。例如,只要更新1.5 %额外参数,它就能实现大约10%和19%的MDMDFEM 。 分别比较的模型。