Despite the fact that image captioning models have been able to generate impressive descriptions for a given image, challenges remain: (1) the controllability and diversity of existing models are still far from satisfactory; (2) models sometimes may produce extremely poor-quality captions. In this paper, two novel methods are introduced to solve the problems respectively. Specifically, for the former problem, we introduce a control signal which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc. With such a control signal, the controllability and diversity of existing captioning models are enhanced. For the latter problem, we innovatively propose a strategy that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one. As a result, this strategy can effectively reduce the proportion of poorquality sentences. Our proposed methods can be easily applie on most image captioning models to improve their overall performance. Based on the Up-Down model, the experimental results show that our methods achieve BLEU- 4/CIDEr/SPICE scores of 37.5/120.3/21.5 on MSCOCO Karpathy test split with cross-entropy training, which surpass the results of other state-of-the-art methods trained by cross-entropy loss.
翻译:尽管图像字幕模型能够给特定图像带来令人印象深刻的描述,但挑战依然存在:(1) 现有模型的可控性和多样性仍然远不能令人满意;(2) 模型有时可能产生极其差质量的字幕;在本文件中,分别采用了两种新的方法解决问题。具体地说,对于前一个问题,我们引入了控制信号,可以控制宏观判决属性,如刑期质量、刑期长度、判刑时数和名词数等。在这种控制信号下,现有字幕模型的可控性和多样性得到加强。关于后一个问题,我们创新地提出了一个战略,即对图像文本匹配模型进行培训,以衡量前向和后向生成的判决质量,并最终选择更好的。结果是,这一战略可以有效降低低质量判决的比例。我们提出的方法很容易适应大多数图像字幕模型,以提高其总体性能。根据高控模式,实验结果显示,我们的方法达到了BLEE-4/CIDer/SPICE的得分数。对于后一问题,我们提出了一个战略,即对图像匹配模型进行训练,以37.5/120.3/21.5为5,最终选择一个更好的模式,通过经过培训的MSCO系统交叉测试结果,通过MCO-Craproplex-Cal-romaxxxxxxxxx