Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning. Unlike previous works that tackle the two sub-tasks separately, recent works have focused on enhancing the inter-task association between the two sub-tasks. However, designing inter-task interactions for event detection and captioning is not trivial due to the large differences in their task specific solutions. Besides, previous event detection methods normally ignore temporal dependencies between events, leading to event redundancy or inconsistency problems. To tackle above the two defects, in this paper, we define event detection as a sequence generation task and propose a unified pre-training and fine-tuning framework to naturally enhance the inter-task association between event detection and captioning. Since the model predicts each event with previous events as context, the inter-dependency between events is fully exploited and thus our model can detect more diverse and consistent events in the video. Experiments on the ActivityNet dataset show that our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data. Code is available at \url{https://github.com/QiQAng/UEDVC}.
翻译:内容繁多的视频字幕旨在为未剪辑的视频中的一系列事件生成相应的文本描述,这些描述可以分为两个子任务、事件探测和事件说明。与以前分别处理两个子任务的工作不同,最近的工作侧重于加强两个子任务之间的任务间联系。然而,设计用于事件探测和说明的跨任务互动并非微不足道,因为其具体任务解决方案存在巨大差异。此外,以往的事件探测方法通常忽略事件之间的时间依赖性,导致事件冗余或不一致问题。为了解决两个缺陷以上,我们在本文件中将事件探测定义为一个序列生成任务,并提议一个统一的训练前和调整框架,以自然地加强事件探测和说明之间的任务间联系。由于模型预测每个事件与以往事件的背景有相互依存关系,因此,事件之间的相互依存性得到了充分利用,因此我们的模型可以探测视频中更为多样和一致的事件。在活动网数据集上进行的实验显示,我们的模型比状态-艺术方法有差异或不一致的问题。在进行大规模前的图像A/Qqrxxx/ 上可以进一步推进。