自动过渡: 学习建议视频过渡效果 (AutoTransition: Learning to Recommend Video Transition Effects)

Video transition effects are widely used in video editing to connect shots for creating cohesive and visually appealing videos. However, it is challenging for non-professionals to choose best transitions due to the lack of cinematographic knowledge and design skills. In this paper, we present the premier work on performing automatic video transitions recommendation (VTR): given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. To solve this task, we collect a large-scale video transition dataset using publicly available video templates on editing softwares. Then we formulate VTR as a multi-modal retrieval problem from vision/audio to video transitions and propose a novel multi-modal matching framework which consists of two parts. First we learn the embedding of video transitions through a video transition classification task. Then we propose a model to learn the matching correspondence from vision/audio inputs to video transitions. Specifically, the proposed model employs a multi-modal transformer to fuse vision and audio information, as well as capture the context cues in sequential transition outputs. Through both quantitative and qualitative experiments, we clearly demonstrate the effectiveness of our method. Notably, in the comprehensive user study, our method receives comparable scores compared with professional editors while improving the video editing efficiency by \textbf{300\scalebox{1.25}{$\times$}}. We hope our work serves to inspire other researchers to work on this new task. The dataset and codes are public at \url{https://github.com/acherstyx/AutoTransition}.

翻译：视频转换效果被广泛用于视频编辑,以连接镜头,创建具有凝聚力和视觉吸引力的视频。然而,由于缺少电影学知识和设计技能,对于非专业人员来说,选择最佳过渡是困难的。在本文中,我们介绍关于自动视频转换建议(VTR):根据原始视频拍摄和配套音频的顺序,建议每对相邻镜头的视频转换。为了完成这项任务,我们使用编辑软件上公开的视频模板,收集大型视频转换数据集。然后,我们将VTR作为从视觉/音频到视频转换的多模式检索问题,并提出由两部分组成的新型多模式匹配框架。首先我们通过视频转换分类学习视频转换建议(VTR):然后我们提出一个模式,学习从视觉/音频输入到视频转换的对应通信。具体地说,为了完成这项任务,我们拟议的模型使用多模式转换器到引信视觉和音频信息,并捕捉到连续过渡产出的背景。通过定量和定性实验,我们清楚地展示了视频转换过程的有效性,我们通过视频转换方法,我们清楚地展示了300元化了我们的数据转换方法,在全面用户研究中,我们的数据修正中,我们用比级的进度分析方法,我们用比平级的进度,我们的方法, 。