Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity. Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context and progressive alignment between the fast-evolved visual content and textual descriptions, which results in redundant and spliced descriptions. In this paper, we introduce the concept of information flow to model the progressive information changing across video sequence and captions. By designing a Cross-modal Information Flow Alignment mechanism, the visual and textual information flows are captured and aligned, which endows the captioning process with richer context and dynamics on event/topic evolution. Based on the Cross-modal Information Flow Alignment module, we further put forward DVCFlow framework, which consists of a Global-local Visual Encoder to capture both global features and local features for each video segment, and a pre-trained Caption Generator to produce captions. Extensive experiments on the popular ActivityNet Captions and YouCookII datasets demonstrate that our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
翻译:高密度视频字幕(DVC)旨在生成多语种描述,以阐明视频中的多重事件,这具有挑战性,要求视觉一致性、公开一致性和语言多样性;现有方法主要产生单个视频部分的字幕,缺乏适应全球视觉背景的适应性,以及快速演变的视觉内容和文本描述之间逐渐保持一致,从而产生冗余和细化的描述;在本文中,我们引入信息流概念,以模拟视频序列和字幕之间不断变化的渐进信息;通过设计跨模式信息流调整机制,视觉和文字信息流动被捕获和匹配,从而在事件/专题演变方面以更丰富的背景和动态给字幕进程带来影响;根据跨模式信息流动调整模块,我们进一步推出DVCFLow框架,其中包括一个全球-本地视频元集,以捕捉每个视频段的全球特征和本地特征,以及一个预先培训的卡普生成字幕的生成器;关于大众活动网卡普和YouCookII数据集的广泛实验,展示了我们的方法大大超越了人类基准,并展示了我们的目标性测试。