Video captioning aims to generate natural language descriptions according to the content, where representation learning plays a crucial role. Existing methods are mainly developed within the supervised learning framework via word-by-word comparison of the generated caption against the ground-truth text without fully exploiting linguistic semantics. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions. In particular, the hierarchy is composed of: (I) Entity level, which highlights objects that are most likely to be mentioned in captions. (II) Predicate level, which learns the actions conditioned on highlighted objects and is supervised by the predicate in captions. (III) Sentence level, which learns the global semantic representation and is supervised by the whole caption. Each level is implemented by one module. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
翻译:视频字幕旨在根据内容生成自然语言描述,其中代表性学习起着关键作用。现有方法主要通过逐字比较生成的字幕与地面真实文本的逐字比较,在监督的学习框架内开发,而没有充分利用语言语义。在这项工作中,我们提议建立一个等级模块网络,在生成字幕之前从三个级别连接视频表达和语言语义。特别是,等级结构由以下几个方面组成:(一) 实体一级,它突出说明最有可能在标题中提及的物体。 (二) 预设级别,它学习以突出对象为条件的行动,并受标题中的上游因素监督。 (三) 句级,它学习全球语义表述,并由整个标题监督。每个级别都由一个模块实施。广泛的实验结果表明,拟议方法在两种广泛使用的基准上,即MSVD 104.0%和MSR-VTT 51.5%的评分中,优于最先进的模式。