Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event sequence selection network to remove redundancy, PDVC directly produces an event set with an appropriate size; (2) In contrast to adopting the two-stage scheme, we feed the enhanced representations of event queries into the localization head and caption head in parallel, making these two sub-tasks deeply interrelated and mutually promoted through the optimization; (3) Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art two-stage methods when its localization accuracy is on par with them. Code is available at https://github.com/ttengwang/PDVC.
翻译:高密度视频字幕旨在从视频的时间位置生成多个相关字幕。 以往的方法遵循复杂的“ 本地化- 现成编程” 方案, 大量依赖手工制作的组件。 在本文中, 我们提出一个简单而有效的框架, 用于端到端密集的视频字幕, 并平行解码( PDVC ), 将密集的字幕生成作为设定的预测任务 。 在实践中, 通过在变压器解码器顶端堆放一个新提议的活动柜台, PDVC 将视频的精密部分分为多个事件片段, 在对视频内容的整体理解下, 有效地提高了预测的字幕的一致性和可读性。 与先前的艺术相比, PDVC 方案有几种吸引人的优势:(1) 不依靠超高性压制或经常性事件序列选择网络来消除冗余性, PDVC 直接制作一个规模适当的活动; (2) 与采用两阶段化方案相比, 我们向本地化头和标题头提供强化的事件描述, 使这两个子片段能够密切关联和可读懂的字幕。