Transfer learning is a cornerstone for a wide range of computer vision problems.It has been broadly studied for image analysis tasks. However, literature for video analysis is scarce and has been mainly focused on transferring representations learned from ImageNet to human action recognition tasks. In this paper, we study transfer learning for Multi-label Movie Trailer Genre Classification (MTGC). In particular, we introduce Trailers12k}, a new manually-curated movie trailer dataset and evaluate the transferability of spatial and spatio-temporal representations learned from ImageNet and/or Kinetics to Trailers12k MTGC. In order to reduce the spatio-temporal structure gap between the source and target tasks and improve transferability, we propose a method that performs shot detection so as to segment the trailer into highly correlated clips. We study different aspects that influence transferability, such as segmentation strategy, frame rate, input video extension, and spatio-temporal modeling. Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k, although they provide complementary information that can be combined to improve classification performance. Having a similar number of parameters and FLOPS, Transformers provide a better transferability base than ConvNets. Nevertheless, competitive performance can be achieved using lightweight ConvNets, becoming an attractive option for low-resource environments.
翻译:转移学习是一系列广泛的计算机视觉问题的基石。 它已经为图像分析任务进行了广泛的研究。 但是,用于视频分析的文献很少,而且主要侧重于将从图像网络学到的表达形式从图像网络转变为人类行动识别任务。 在本文中,我们研究多标签电影拖车 Genre 分类(MTGC)的转移学习。 特别是,我们引进了人工手工制作的新电影拖车数据集,并评价从图像网络和/或动因网络学到拖车12k MTGC的空间和时空表达方式的可转移性。 为了缩小源和目标任务之间的时空结构差距并改进可转移性,我们提出了一种进行拍摄检测的方法,以便将拖车分割成高度关联的剪辑。我们研究了影响可转移性的不同方面,例如分解战略、框架率、输入视频扩展和瞬时尚模型模型。 我们的结果表明,从图像网络或动因果网络学到拖车12k 。 为了减少源和目标任务之间的瞬间结构结构差距,我们提出了一种可以进行补充性探测的方法,以便将拖车的探测方法分成一个基础,从而改进了稳定的变现性环境。