Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach the captioning model how to generate plausible sentences. Meanwhile, the knowledge in the visual concept detector is distilled into the captioning model to guide the model to recognize the visual concepts in an image. In order to further encourage the generated captions to be semantically consistent with the image, the image and caption are projected into a common latent space so that they can be used to reconstruct each other. Given that the existing sentence corpora are mainly designed for linguistic research and thus with little reference to image contents, we crawl a large-scale image description corpus of 2 million natural sentences to facilitate the unsupervised image captioning scenario. Experimental results show that our proposed model is able to produce quite promising results without using any labeled training pairs.
翻译:深心神经网络在图像说明任务上取得了巨大成功。 然而, 大部分现有模型都严重依赖配对图像感应数据集, 这些数据集非常昂贵。 在本文中, 我们第一次尝试以不受监督的方式培训图像说明模型。 我们提议的模型仅仅需要一组图像、 句子和现有的视觉概念检测器。 句体用于教授字幕模型如何产生可信的句子。 与此同时, 视觉概念探测器的知识被浸入到说明模型中, 以引导模型识别图像中的视觉概念。 为了进一步鼓励所生成的字幕与图像相一致, 图像和字幕被投射到一个共同的潜在空间, 以便使用它们来重建彼此。 鉴于现有的句子组合主要是为语言研究设计的, 因而很少提及图像内容, 我们爬取了一个200万个大型图像描述库, 以引导模型识别图像中的视觉概念。 为了进一步鼓励所生成的图像说明与图像一致, 实验性结果显示我们提出的模型能够不使用任何有希望的图像说明情景。 实验性培训结果显示我们提出的模型能够不使用任何有希望的模型。