Understanding movies and their structural patterns is a crucial task in decoding the craft of video editing. While previous works have developed tools for general analysis, such as detecting characters or recognizing cinematography properties at the shot level, less effort has been devoted to understanding the most basic video edit, the Cut. This paper introduces the Cut type recognition task, which requires modeling multi-modal information. To ignite research in this new task, we construct a large-scale dataset called MovieCuts, which contains 173,967 video clips labeled with ten cut types defined by professionals in the movie industry. We benchmark a set of audio-visual approaches, including some dealing with the problem's multi-modal nature. Our best model achieves 47.7% mAP, which suggests that the task is challenging and that attaining highly accurate Cut type recognition is an open research problem. Advances in automatic Cut-type recognition can unleash new experiences in the video editing industry, such as movie analysis for education, video re-editing, virtual cinematography, machine-assisted trailer generation, machine-assisted video editing, among others. Our data and code are publicly available: https://github.com/PardoAlejo/MovieCuts}{https://github.com/PardoAlejo/MovieCuts.
翻译:理解电影及其结构模式是解码视频编辑手法的关键任务。 虽然先前的作品开发了一般分析工具, 如在镜头水平上检测字符或识别电影摄影特性, 但用于理解最基本的视频编辑“ Cut”。 本文介绍了 Cut 类型识别任务, 需要建模多模式信息。 为了点燃对这一新任务的研究, 我们建造了一个名为MoveCuts 的大型数据集, 包含173 967个视频剪辑, 标签由电影行业专业人员界定的10种剪辑。 我们为一套视听方法设定基准, 包括一些处理问题多模式性质的部分。 我们的最佳模型实现了47.7%的 mAP, 这表明这项任务具有挑战性, 实现高度准确的剪辑类型识别是一个开放的研究问题。 自动剪辑识别的进展可以释放视频编辑行业的新经验, 例如用于教育的电影分析、 视频重新编辑、虚拟电影摄影、 机器辅助拖车生成、 机器辅助视频编辑等。 我们的数据和代码可以公开查阅: https://Movie/Pleastoubas.