The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video. Semantic information plays an important role for both localization and description of DVC. We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. In the encoding stage, we design a concept detector to extract semantic information, which is then fused with multi-modal visual features to sufficiently represent the input video. In the decoding stage, we design a classification head, paralleled with the localization and captioning heads, to provide semantic supervision. Our method achieves significant improvements on the YouMakeup dataset under DVC evaluation metrics and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of PIC 4th Challenge.
翻译:Dense Video Captariation (DVC) 的任务旨在为一个视频中的多个事件生成带有时标的字幕。语义信息对于DVC的本地化和描述都起着重要作用。 我们展示了一个基于编码解码框架的语义辅助密集视频字幕模型。 在编码阶段,我们设计了一个概念检测器来提取语义信息,然后与多式视觉特征结合,以充分代表输入视频。在解码阶段,我们设计了一个分类头,与本地化和字幕头平行,以提供语义监督。我们的方法在DVC评价指标下的Youmakeup数据集上取得了重大改进,并在PIC第四次挑战的假称视频描述(MDVC)任务中取得了高性能。