In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
翻译:在本文中,我们研究了ImageNet空间和运动表示向多标签电影预告片类型分类(MTGC)的可转移性。具体而言,我们对ImageNet和Kinetics预先训练的ConvNet和Transformer模型的可转移性进行了广泛评估,以Trailers12k为基础,这是一个由12000个视频标记为10个不同类型和关联元数据的新手动筛选电影预告片数据集。我们分析了可以影响可转移性的不同方面,例如帧速率,输入视频扩展和时空建模。为了减少ImageNet / Kinetics与Trailers12k之间的时空结构差距,我们提出了双图像和视频变压器体系结构(DIViTA),它执行镜头检测,以将预先训练的主干分段为高度相关的剪辑,为预先训练的背景提供更具凝聚力的输入,从而提高了可转移性(ImageNet提高了1.83%,Kinetics提高了3.75%)。我们的结果表明,ImageNet或Kinetics学习到的表示相对于Trailers12k是可以转移的。此外,两个数据集提供了可以组合以改进分类性能的互补信息(与最高单个预训练相比,为2.91%)。有趣的是,使用轻量级的ConvNets作为预先训练的主干结果,在分类性能方面,与最高Transformer相比仅下降了3.46%,同时仅需要其11.82%的参数和0.81%的FLOPS。