Chapter generation becomes practical technique for online videos nowadays. The chapter breakpoints enable users to quickly find the parts they want and get the summative annotations. However, there is no public method and dataset for this task. To facilitate the research along this direction, we introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information. Our data collection procedure is fast, scalable and does not require any additional manual annotation. On top of this dataset, we design an effective baseline specificlly for video chapters generation task. which captures two aspects of a video,including visual dynamics and narration text. It disentangles local and global video features for localization and title generation respectively. To parse the long video efficiently, a skip sliding window mechanism is designed to localize potential chapters. And a cross attention multi-modal fusion module is developed to aggregate local features for title generation. Our experiments demonstrate that the proposed framework achieves superior results over existing methods which illustrate that the method design for similar task cannot be transfered directly even after fine-tuning. Code and dataset are available at https://github.com/czt117/MVCG.
翻译:章断点使用户能够快速找到他们想要的部件,并获得附加说明。 但是, 没有公开的方法和数据集 。 为了便于沿着这个方向进行研究, 我们引入了一个新的数据集, 叫做Capi- Gen, 由大约 10 k 个用户生成的带附加说明章节信息的视频组成。 我们的数据收集程序是快速、 可缩放的, 不需要额外的手工注释。 在此数据集之上, 我们为视频章节生成任务设计了一个有效的基准特质。 它可以捕捉视频章节生成的两个方面, 包括视觉动态和解析文本。 它会分解本地和全球的视频特性, 分别用于本地化和标题生成。 为了高效地分析长视频, 设计了一个跳过滑动窗口机制, 将潜在章节本地化。 我们开发了一个交叉关注的多模式融合模块, 以汇总标题生成的本地特性 。 我们的实验证明, 拟议的框架取得了优于现有方法的优异效果, 这表明类似任务的方法设计即使在微调之后也无法直接转移。 代码和数据集可以在 https://gimb.