Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.
翻译:视频字幕旨在自动生成自然语言句,可以描述某一视频的视觉内容; 现有的基因模型,如编码器-代码框架,无法明确探索从复杂的时空数据中生成的物体级互动和框架级信息,以生成语义丰富的字幕; 我们的主要贡献是在未来的视频总结任务联合框架内找出三个关键问题。 1) 强化对象提案:我们提议了一个新的条件图,可以将时空信息纳入潜在目标提案。 2) 视觉知识:提议以动态方式提取具有更高语义水平的视觉字眼。 3) 判决校验:提议建立一个新的差异性语言验证器,以核实生成的字幕,从而能够有效地保存关键的语义概念。我们在两个公共数据集(MVSD和MSR-VTT)上进行的实验表明在所有测量上,特别是在BLEU-4和CIDER上,对州-艺术方法方法的重大改进。 我们的代码可在https://github.com/baiyang4/D-Caption-Video)上查阅。