Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.
翻译:大量视频字幕是一项极具挑战性的任务,因为在视频中对事件的准确和连贯描述要求全面理解视频内容以及个别事件的背景推理。大多数现有方法通过首先从视频中发现事件提案,然后对部分提案进行字幕处理该问题。因此,生成的句子可能多余或不一致,因为它们没有考虑到事件之间的时间依赖性。为了应对这一挑战,我们提议了一个新型的密集视频字幕框架,在视频中明确模拟事件之间的时间依赖性,并利用以往事件的视觉和语言背景进行连贯叙事。实现这一目标的途径是:(1) 整合事件序列生成网络,以适应性地选择事件提案的顺序;(2) 将事件序列提案反馈到我们的连续视频字幕网络,通过在事件和事件层面加强学习,同时提供两个层次的奖励,以更好地进行背景建模。拟议的技术在大多数指标中实现了活动网卡码数据集的出色表现。