Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from advertisers. To evaluate the methods, we propose a unified metric, Imp-Coh@Time, which comprehensively assesses the importance, coherence, and duration of the outputs at the same time. Experimental results show that our method achieves better performance than random selection and the previous method on the metric. Ablation experiments further verify that multi-modal representation and importance-coherence reward significantly improve the performance. Ads-1k dataset is available at: https://github.com/yunlong10/Ads-1k
翻译:广告视频编辑旨在自动将广告视频编辑成较短的视频,同时保留连贯的内容和广告商传递的重要信息,主要包括两个阶段:视频分层和分段组合。现有方法在视频分层阶段运行良好,但因在片段组合阶段依赖超繁琐模型和不良性能而受到影响。为解决这些问题,我们建议M-SAN(多式段组合网络)能够高效和一致地段组合任务端至端。它利用从各段提取的多模式代表,并遵循Encoder-Decoder Ptr-Net框架,并采用注意机制。为培训M-SAN设计了重要性-一致性奖励。我们在Ads-1k数据集上试验,在从广告商收集的丰富的广告情景下用1000+视频进行试验。为评估方法,我们建议一个统一的衡量标准,即Inmp-Coh@Time,全面评估产出的重要性、一致性和持续时间。实验结果显示,我们的方法比随机数据选择和以往的测试方法更加重要。