Image captioning is a longstanding problem in the field of computer vision and natural language processing. To date, researchers have produced impressive state-of-the-art performance in the age of deep learning. Most of these state-of-the-art, however, requires large volume of annotated image-caption pairs in order to train their models. When given an image dataset of interests, practitioner needs to annotate the caption for each image in the training set and this process needs to happen for each newly collected image dataset. In this paper, we explore the task of unsupervised image captioning which utilizes unpaired images and texts to train the model so that the texts can come from different sources than the images. A main school of research on this topic that has been shown to be effective is to construct pairs from the images and texts in the training set according to their overlap of objects. Unlike in the supervised setting, these constructed pairings are however not guaranteed to have fully overlapping set of objects. Our work in this paper overcomes this by harvesting objects corresponding to a given sentence from the training set, even if they don't belong to the same image. When used as input to a transformer, such mixture of objects enables larger if not full object coverage, and when supervised by the corresponding sentence, produced results that outperform current state of the art unsupervised methods by a significant margin. Building upon this finding, we further show that (1) additional information on relationship between objects and attributes of objects also helps in boosting performance; and (2) our method also extends well to non-English image captioning, which usually suffers from a scarcer level of annotations. Our findings are supported by strong empirical results. Our code is available at https://github.com/zihangm/obj-centric-unsup-caption.
翻译:图像字幕是计算机视觉和自然语言处理领域长期存在的一个问题。 到目前为止, 研究人员在深层学习时代产生了令人印象深刻的先进性能。 然而, 这些最先进的艺术需要大量的附加说明的图像插图配对才能训练模型。 当给一个利益图像数据集时, 从业人员需要为每张新收集的图像数据集中的每一张图像注解标题, 而这个过程需要发生在每张新收集的图像数据集中。 在本文中, 我们探索了未经监督的图像字幕说明任务, 利用未损坏的图像和文本来训练模型, 使文本能够来自不同来源。 然而, 这些最先进的图像配方需要大量的附加性能配对来训练。 与受监督的设置不同的是, 这些构建的配对不能保证有更完全的重叠的物体。 我们本文中的工作克服了这个任务, 与某个特定句子对应的标本, 即便它们不是来自更强大的图像来源, 也有助于改变我们当前图像的高级值。 当我们使用这样的导算法时, 当我们使用的输入时, 也通过一个更精确的版本的图像结果。