Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the challenge by proposing a novel hierarchical reinforcement learning framework for video captioning, where a high-level Manager module learns to design sub-goals and a low-level Worker module recognizes the primitive actions to fulfill the sub-goal. With this compositional framework to reinforce video captioning at different levels, our approach significantly outperforms all the baseline methods on a newly introduced large-scale dataset for fine-grained video captioning. Furthermore, our non-ensemble model has already achieved the state-of-the-art results on the widely-used MSR-VTT dataset.
翻译:视频字幕是自动生成视频中动作文字描述的任务。虽然先前的工作(如顺序到顺序模型)在抽取短视频粗略描述方面显示了有希望的结果,但是在给包含多细微细微动作的视频字幕上加上详细描述仍然非常困难。本文旨在通过为视频字幕提出一个新的等级强化学习框架来应对这一挑战,在这个框架中,一个高级管理员模块学会设计次级目标和一个低级别工作模块承认实现次级目标的原始行动。有了这个组合框架,在不同级别加强视频字幕描述,我们的方法大大超过新推出的微细微细微微微微微字幕大规模数据集的所有基线方法。此外,我们的非组合模型已经实现了广泛使用的 MSR-VTT数据集的最新成果。