Automatically identifying harmful content in video is an important task with a wide range of applications. However, due to the difficulty of collecting high-quality labels as well as demanding computational requirements, the task has not yet had a fully general approach. Typically, only small subsets of the problem are considered, such as identifying violent content. In cases where the general problem is tackled, approximations and simplifications are made to deal with the lack of labels and computational complexity. In this work, we identify and tackle some of the main obstacles. First, we create an open dataset of 3589 video clips from film trailers and annotated by professionals in the field. Second, we perform an analysis of our constructed dataset, investigating among other things the relation between clip and trailer level annotations. Lastly, we train audiovisual models on our dataset and conduct an in-depth study on our modeling choices. We find that results greatly improve by combining the visual and audio modality and that pre-training on large-scale video recognition datasets as well as class balanced sampling further improves performance. Further details of our dataset is available at this webpage: https://vidharm.github.io/.
翻译:自动识别视频中的有害内容是一项重要任务,其应用范围很广。然而,由于难以收集高质量标签和要求严格的计算要求,这项任务尚未完全采用一般方法。通常,只考虑问题中的一小部分,例如暴力内容。在解决一般性问题的情况下,采用近似和简化方法处理缺乏标签和计算复杂性的问题。在这项工作中,我们查明并处理一些主要障碍。首先,我们从电影拖车和实地专业人员注解的3 589个视频短片中创建了一个开放数据集。第二,我们分析我们建造的数据集,调查剪辑和拖车级别说明之间的关系。最后,我们用我们的数据集来培训视听模型模型,并对我们的模型选择进行深入研究。我们发现,通过将视觉和音频模式结合起来,以及大规模视频识别数据集的预先培训以及分类平衡抽样,结果大为改善了业绩。我们数据集的更多细节可见此网页:https://vidharm.githhub.bio。