It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this paper, we propose a non-autoregressive decoding based model with a coarse-to-fine captioning procedure to alleviate these defects. In implementations, we employ a bi-directional self-attention based network as our language model for achieving inference speedup, based on which we decompose the captioning procedure into two stages, where the model has different focuses. Specifically, given that visual words determine the semantic correctness of captions, we design a mechanism of generating visual words to not only promote the training of scene-related words but also capture relevant details from videos to construct a coarse-grained sentence "template". Thereafter, we devise dedicated decoding algorithms that fill in the "template" with suitable words and modify inappropriate phrasing via iterative refinement to obtain a fine-grained description. Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency. Our code is available at https://github.com/yangbang18/Non-Autoregressive-Video-Captioning.
翻译:令人振奋的是看到在连接视频和自然语言方面取得了进展;然而,主流视频字幕方法由于自动递减解码的顺序方式而出现缓慢的推断速度,而且由于视觉文字(如名词和动词)培训不足和解码模式不足,更倾向于生成通用描述;在本文件中,我们提议了一个基于非视觉解码模式,并有一个粗略至直线字幕程序,以缓解这些缺陷。在实施过程中,我们使用双向自留网络作为我们实现自评速度的语言模型,在此基础上,我们将字幕程序分解成两个阶段,模式的重点不同。具体地说,鉴于视觉文字决定了字幕的语义正确性,我们设计了一个生成视觉文字的机制,不仅促进现场文字培训,而且从视频中获取相关细节,以构建一个粗略至直径的描述“模板”。 之后,我们设计了专门的解码算法,在“高级版”中添加了“高级节略”的描述,在视频/正轨中,通过适当的图像演示中实现了“升级” 。