Recently, image captioning has aroused great interest in both academic and industrial worlds. Most existing systems are built upon large-scale datasets consisting of image-sentence pairs, which, however, are time-consuming to construct. In addition, even for the most advanced image captioning systems, it is still difficult to realize deep image understanding. In this work, we achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information. The motivation stems from the fact that the semantic concepts with the same modality can be extracted from both images and descriptions. To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image. Extensive experiments on MSCOCO dataset show that we can generate desirable captions without paired datasets. Furthermore, the proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%, demonstrating that it is effective and generalizes well to a wide range of models.
翻译:最近,图像字幕在学术和工业界引起了极大兴趣。大多数现有系统都建立在大型数据集之上,这些数据集由图像-感应对配对组成,但是,这些配对需要花费时间。此外,即使是最先进的图像字幕系统,仍然难以实现深刻的图像理解。在这项工作中,我们通过将视觉和语言领域与高层次语义信息连接起来,实现了无偏颇的图像字幕。其动机来自一个事实,即具有相同模式的语义概念可以从图像和描述中提取出来。为了进一步提高模型生成的字幕的质量,我们建议使用语义关系探索器,探索语义概念之间的关系,以便更好地了解图像。关于MCCO数据集的广泛实验表明,我们可以在没有配对数据集的情况下生成理想的字幕。此外,拟议方法在配对环境中提升了五个强的基线,其中CIDer最显著的改进达到8%,表明它有效,并广泛概括到各种模型。