Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
翻译:图像字幕是一个直接的任务,它能够利用大规模网络搜索的数据,为字幕模型提供关于视觉世界的丰富知识。然而,由于网络搜索的数据包含在不同级别对齐的图像-文本配对,因此固有的噪音(如对配错)使得难以学习精确的字幕型号。虽然过滤战略可以有效地消除扰动数据,但它导致可学习知识的减少,有时还带来数据缺陷的新问题。为了让这两个世界达到最佳水平,我们建议了一个有噪音意识的学习框架,从整个网络搜索的数据中学习丰富的知识,而不受噪音的影响较小。这是通过拟议的质量控制模型(如对配错配对配对)实现的,这是用图像-文本配对的校准水平作为培训期间的额外控制信号学习的。调整后的培训使模型能够生成高质量的模型,只要将控制信号设置到更精确的时间,就可以产生一个更好的匹配水平。通过深度分析,我们展示了从整个网络搜索数据中获取的可控制性说明性说明性说明性说明性说明性说明,我们也可以在高水平上进行控制性说明性说明性说明性说明。