争取实现视觉参数统一观点 -- -- 高效传输学习 (Towards a Unified View on Visual Parameter-Efficient Transfer Learning)

Since the release of various large-scale natural language processing (NLP) pre-trained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.

翻译：自推出各种大型自然语言处理(NLP)预培训模式以来,参数高效传输学习(PETL)已成为一种流行范例,能够在各种下游任务中取得令人印象深刻的业绩。PETL的目标是通过微调少数参数,在经过培训的大模型中很好地利用代表性知识。最近,它也吸引了越来越多的注意力,为愿景任务开发各种PETL技术。在图像分类和视频识别等高层次直观下游任务中,提出了PETL技术,如快速调控和调控等。然而,对愿景任务而言,Prefix调控仍未得到充分的探索。在这项工作中,我们打算将大型基于视频的模型与下游任务相适应,而下游任务则具有良好的参数准确性能。为了实现这一目标,我们提出了一个统一框架,称为视觉-PETL(V-PETL),以调查影响贸易的不同方面。具体地说,我们分析了可训练的参数和NLPP和愿景在数据结构和培训前机制方面的差异。我们打算对各种PETAT技术进行大幅的升级,特别是用于通过SAL系统进行新的数据变异性分析。