视频分类转让学习:多个领域的视频Swin变换器 (Transfer-learning for video classification: Video Swin Transformer on multiple domains)

The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.

翻译：计算机视觉群落已经看到一个转变, 包括图像和视频任务。为这些任务从零培训变压器通常需要大量的数据和计算资源。视频 Swin 变压器( VST) 是一个为视频分类开发的纯转换模型, 它能实现最先进的图像分类的准确性和效率。在本文中, 我们的目标是了解 VST 的概括性是否足够好, 以便在场外设置中使用纯变压器。我们在两个大型数据集, 即 FCVID 和 something- 4 类的变压器上, 通常需要大量的数据和变压器的变压器, 通常需要从 Kinitics- 400 进行转移学习。我们的实验显示 FST 最先进的变压器在FCVID 上显示85 顶级的精度, 但没有再更新整个变压式的模型。我们的变压式和21° 级的变压的变压器, 也就是从一个我们所观察到的变压的变压的变压的变换到一个VST 的变压的变压, 当我们用来的变压的变压的变动的变动的变压的变压的变压的变压的变换到一个VST 。