改善转移学习：双图像和视频变压器用于多标签电影预告片类型分类 (Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification)

In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.

翻译：在本文中，我们研究了ImageNet空间和运动表示向多标签电影预告片类型分类（MTGC）的可转移性。具体而言，我们对ImageNet和Kinetics预先训练的ConvNet和Transformer模型的可转移性进行了广泛评估，以Trailers12k为基础，这是一个由12000个视频标记为10个不同类型和关联元数据的新手动筛选电影预告片数据集。我们分析了可以影响可转移性的不同方面，例如帧速率，输入视频扩展和时空建模。为了减少ImageNet / Kinetics与Trailers12k之间的时空结构差距，我们提出了双图像和视频变压器体系结构（DIViTA），它执行镜头检测，以将预先训练的主干分段为高度相关的剪辑，为预先训练的背景提供更具凝聚力的输入，从而提高了可转移性（ImageNet提高了1.83％，Kinetics提高了3.75％）。我们的结果表明，ImageNet或Kinetics学习到的表示相对于Trailers12k是可以转移的。此外，两个数据集提供了可以组合以改进分类性能的互补信息（与最高单个预训练相比，为2.91％）。有趣的是，使用轻量级的ConvNets作为预先训练的主干结果，在分类性能方面，与最高Transformer相比仅下降了3.46％，同时仅需要其11.82％的参数和0.81％的FLOPS。

相关内容

ImageNet (数据集)

关注 21

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象;在至少一百万个图像中，还提供了边界框。ImageNet包含2万多个类别; [2]一个典型的类别，如“气球”或“草莓”，包含数百个图像。第三方图像URL的注释数据库可以直接从ImageNet免费获得;但是，实际的图像不属于ImageNet。自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛（ILSVRC），软件程序竞相正确分类检测物体和场景。 ImageNet挑战使用了一个“修剪”的1000个非重叠类的列表。2012年在解决ImageNet挑战方面取得了巨大的突破，被广泛认为是2010年的深度学习革命的开始。

【CVPR2022】CAT-Det:用于多模态三维物体检测的对比增强Transformer

专知会员服务

19+阅读 · 2022年4月7日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日