Automatically identifying harmful content in video is an important task with a wide range of applications. However, due to the difficulty of collecting high-quality labels as well as demanding computational requirements, the task has not had a satisfying general approach. Typically, only small subsets of the problem are considered, such as identifying violent content. In cases where the general problem is tackled, rough approximations and simplifications are made to deal with the lack of labels and computational complexity. In this work, we identify and tackle the two main obstacles. First, we create a dataset of approximately 4000 video clips, annotated by professionals in the field. Secondly, we demonstrate that advances in video recognition enable training models on our dataset that consider the full context of the scene. We conduct an in-depth study on our modeling choices and find that we greatly benefit from combining the visual and audio modality and that pretraining on large-scale video recognition datasets and class balanced sampling further improves performance. We additionally perform a qualitative study that reveals the heavily multi-modal nature of our dataset. Our dataset will be made available upon publication.
翻译:自动识别视频中的有害内容是一项重要任务,其应用范围很广。然而,由于难以收集高质量标签和严格的计算要求,这项任务没有令人满意的一般方法。通常,只考虑问题中的一小部分,例如暴力内容。在解决一般性问题的情况下,对缺乏标签和计算复杂性进行粗略的近似和简化处理。在这项工作中,我们找出并解决两个主要障碍。首先,我们制作了一套大约4 000个视频剪辑数据集,由外地专业人员附加说明。第二,我们证明在视频识别方面的进展使得我们数据集的培训模式能够考虑整个场景。我们深入研究了我们的模型选择,发现将视觉和音频模式结合起来,对大规模视频识别数据集和分类平衡取样进行预先培训,对我们大有裨益。我们还进行了定性研究,揭示了我们数据集的高度多模式性质。我们的数据集将在出版物上公布。