Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score increases from 4.82 to 9.65).
翻译:内容浓厚的视频字幕是一项新出现的新任务,目的是在视频中将所有事件本地化和描述所有事件。我们确定和应对这项任务上的两项挑战,即:(1) 如何利用过去和今后的背景进行准确的事件提案预测,以及(2) 如何为解码器构建信息投入,以生成自然事件描述。首先,以往的工作主要在前方产生时间事件建议,忽视未来的视频背景。我们提出了一个双向建议方法,有效地利用过去和今后的背景来作出提案预测。第二,在同一时间结束的不同事件在以往工作中是无法区分的,导致相同的标题。我们解决这个问题的方法是,通过将每个事件从建议模块和视频内容(如C3D特征)中仔细地混合隐藏的状态和视频内容来代表每个事件。我们进一步提议了一个新的背景描述机制,以平衡当前事件及其周围环境的贡献。我们根据经验表明,我们精心结合的事件代表比重的状态或视频内容都优于先前的作品,导致相同的标题。我们通过将每个事件模块与一个统一的框架,即我们标定的模型比重4.85MLA的比值超过一个标准。