The goal of unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase. Although challenging, we except the task can be accomplished by leveraging a training set of images aligned with visual concepts. Most existing studies use off-the-shelf algorithms to obtain the visual concepts because the Bounding Box (BBox) labels or relationship-triplet labels used for the training are expensive to acquire. In order to resolve the problem in expensive annotations, we propose a novel approach to achieve cost-effective UIC. Specifically, we adopt image-level labels for the optimization of the UIC model in a weakly-supervised manner. For each image, we assume that only the image-level labels are available without specific locations and numbers. The image-level labels are utilized to train a weakly-supervised object recognition model to extract object information (e.g., instance) in an image, and the extracted instances are adopted to infer the relationships among different objects based on an enhanced graph neural network (GNN). The proposed approach achieves comparable or even better performance compared with previous methods without the expensive cost of annotations. Furthermore, we design an unrecognized object (UnO) loss combined with a visual concept reward to improve the alignment of the inferred object and relationship information with the images. It can effectively alleviate the issue encountered by existing UIC models about generating sentences with nonexistent objects. To the best of our knowledge, this is the first attempt to solve the problem of Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on image-level labels. Extensive experiments have been carried out to demonstrate that the proposed WS-UIC model achieves inspiring results on the COCO dataset while significantly reducing the cost of labeling.
翻译:UIC 的目标是在培训阶段不使用图像显示配对来描述图像。 尽管我们的任务具有挑战性, 但我们的任务可以通过使用与视觉概念相匹配的图像培训组合来完成。 大多数现有研究使用现成的算法来获取视觉概念, 因为用于培训的Bounding Box(BBox)标签或关系三重标签成本高昂。 为了在昂贵的注释中解决问题, 我们提议了一种创新的方法, 以实现具有成本效益的 UIC 。 具体地说, 我们采用图像级标签来优化 UIC 模型, 其方式是, 以薄弱的超强监督方式优化UIC 模型。 对于每一种图像, 我们假设只有图像级标签可以使用现成的现成算法来获取视觉识别模型信息。 所拟议的图像级标签只能通过强化的图形解析网络( GNNEN) 来评估不同对象之间的关系。 所拟议的图像级标签只能通过不易变现、甚至更精确的图像级化模型, 与前期的图像模型相比, 我们的模型只能以高廉的图像比重的模型, 。