Automatic understanding of movie-scenes is an important problem with multiple downstream applications including video-moderation, search and recommendation. The long-form nature of movies makes labeling of movie scenes a laborious task, which makes applying end-to-end supervised approaches for understanding movie-scenes a challenging problem. Directly applying state-of-the-art visual representations learned from large-scale image datasets for movie-scene understanding does not prove to be effective given the large gap between the two domains. To address these challenges, we propose a novel contrastive learning approach that uses commonly available sources of movie-information (e.g., genre, synopsis, more-like-this information) to learn a general-purpose scene-representation. Using a new dataset (MovieCL30K) with 30,340 movies, we demonstrate that our learned scene-representation surpasses existing state-of-the-art results on eleven downstream tasks from multiple datasets. To further show the effectiveness of our scene-representation, we introduce another new dataset (MCD) focused on large-scale video-moderation with 44,581 clips containing sex, violence, and drug-use activities covering 18,330 movies and TV episodes, and show strong gains over existing state-of-the-art approaches.
翻译:对电影片段的自动理解是多个下游应用(包括视频更新、搜索和建议)的一个重要问题。电影的长式性质使电影场景标签成为一项艰巨的任务,这使得采用端到端监督的方法来理解电影场景成为一项具有挑战性的问题。 直接应用从大型图像数据集中获取的最新视觉演示来理解电影场景,由于这两个领域之间存在巨大差距,因此证明无法有效。 为了应对这些挑战,我们提议采用新的对比式学习方法,利用普通的电影信息来源(例如,genre、short、更类似的信息)来学习通用的场景展示。 使用带有30,340部电影的新数据集(MovieCL30K),我们证明我们所学到的场景展示超过现有11个下游任务的现有状态结果。 为了进一步展示我们的场景展示的有效性,我们提出了另一个新的数据集,侧重于大规模视频模式(例如,genregreat,sh,更类似的信息)来学习通用的场景场景介绍一个通用的场景场景展示。我们用44,583,30片段展示了18级视频场景片段的动态、现有性、短片片片片片段和短片段。