Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated with visual scenes in movies include transitions, person coverage, and a wide array of real-life and fictional scenarios. Existing visual scene datasets in movies have limited taxonomies and don't consider the visual scene transition within movie clips. In this work, we address the problem of visual scene recognition in movies by first automatically curating a new and extensive movie-centric taxonomy of 179 scene labels derived from movie scripts and auxiliary web-based video datasets. Instead of manual annotations which can be expensive, we use CLIP to weakly label 1.12 million shots from 32K movie clips based on our proposed taxonomy. We provide baseline visual models trained on the weakly labeled dataset called MovieCLIP and evaluate them on an independent dataset verified by human raters. We show that leveraging features from models pretrained on MovieCLIP benefits downstream tasks such as multi-label scene and genre classification of web videos and movie trailers.
翻译:长相媒体如电影具有复杂的叙述结构,事件涉及环境视觉场景的种类繁多。 与电影视觉场景有关的具体挑战包括转型、 人的报道、 以及各种真实和虚构的情景。 电影中现有的视觉场景数据集的分类有限, 不考虑电影剪辑中的视觉场景过渡。 在这项工作中, 我们首先通过自动整理电影剧本和辅助网络视频数据集产生的179个场景标签的新的和广泛的电影中心分类, 来解决电影中的视觉场景识别问题。 我们使用CLIP来对基于我们拟议分类的32K片片片段的112万张照片进行微弱的标签。 我们提供的基线视觉模型是标签薄弱的数据集, 叫做MovicCLIP, 并用由人类计数员核实的独立数据集来评估这些照片。 我们显示, 电影CLIP 模型的利用功能有利于下游任务, 如多标签场景以及网络视频视频和电影拖车等。