In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DualPath can be effectively generalized beyond the data domain.
翻译:在本文中,我们有效地利用可训练参数将视觉基础模型(例如ViT和Swin)的优越表示能力转移至视频理解。以往的适应性方法同时考虑了空间和时间建模,采用可统一学习模块,但仍无法充分利用图像转换器的代表性能力。我们认为,视频模型中流行的双通道(two-stream)架构可以缓解这个问题。我们提出了一种新的双通道适应,将空间和时间适应路径分开,每个变压器块使用轻量级瓶颈适配器。特别地,对于时间动态建模,我们将连续帧组合成类似于网格的帧集,以精确模仿视觉变换器的能力,即推断标记之间的关系。此外,我们从统一的角度广泛调查了多个基线,并将它们与DualPath进行了比较。在四个动作识别基准测试上的实验结果证明,带有DualPath的预训练图像变换器可以有效地推广到数据领域之外。