Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images. Recently, Nikolaus et al. (2019) introduced a dataset to assess compositional generalization in image captioning, where models are evaluated on their ability to describe images with unseen adjective-noun and noun-verb compositions. In this work, we investigate different methods to improve compositional generalization by planning the syntactic structure of a caption. Our experiments show that jointly modeling tokens and syntactic tags enhances generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.
翻译:图像字幕侧重于对从与培训组相同的分布中提取的图像进行概括化,而不是对推广到不同图像分布的更具挑战性的问题进行概括化。 最近,Nikolaus等人(2019年)引入了一个数据集来评估图像字幕中的成像概括化,在其中,模型被评估其用不可见的形容词-名词和名词-动词组成来描述图像的能力。在这项工作中,我们调查了通过规划一个字幕的合成结构来改进成像化的各种不同方法。我们的实验显示,联合建模符号和合成标记可以加强基于 RNN 和变异器模型的通用化,同时提高标准度的性能。