Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
翻译:图像字幕模型通常根据其描述一组隐藏的图像的能力来评估,而不是其概括到不可见概念的能力。我们研究了合成集的问题,它衡量模型在描述图像时如何很好地组成了各种概念的无形组合。最先进的图像字幕模型在这项任务上表现不佳。我们提出了一个多任务模型,以解决不良的性能,将字幕生成和图像认知排序结合起来,并使用一种解码机制,根据与图像的相似性重新排列字幕。这个模型在将各种概念与最先进的字幕模型相比较时,将各种概念的隐性组合归纳为隐性组合方面要好得多。