In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
翻译:在本文中,我们研究了图像网空间和动因空间和动因时空表达方式向多标签电影拖车 Genre 分类(MTGC)的可转移性。特别是,我们广泛评价了ConvNet和变异器模型在图像网和动因12k上预先训练成Triilers12k的可转移性。 这是一个新的手工制作的电影拖车数据集,由12 000个视频组成,标记有10种不同类型和相关元数据。我们分析了可能影响可转移性的不同方面,例如框架率、输入的视频扩展以及空间-时空模型。为了缩小图像网/Kinetics和Trailers12k之间的空间-时代结构差距,我们提出了双层图像网和变异器模型(DIVITA)的可转移性能,该模型进行拍摄,以便将拖车分成高度相联的剪辑片,为受过预先训练的骨架提供更加一致的投入,改进的可转移性能(只有1.83%的图像网和3.75 %的递增) 我们的结果表明,无论是在图像网或Kiniteriter1上或Kiniter1 之间的结构模型都比较可转换为Supreal1,同时使用最接近的性能分析。此外的数据。此外,可以提供比较性能数据。