Discriminativeness is a desirable feature of image captions: captions should describe the characteristic details of input images. However, recent high-performing captioning models, which are trained with reinforcement learning (RL), tend to generate overly generic captions despite their high performance in various other criteria. First, we investigate the cause of the unexpectedly low discriminativeness and show that RL has a deeply rooted side effect of limiting the output words to high-frequency words. The limited vocabulary is a severe bottleneck for discriminativeness as it is difficult for a model to describe the details beyond its vocabulary. Then, based on this identification of the bottleneck, we drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation. Hinted by long-tail classification and debiasing methods, we propose methods that easily switch off-the-shelf RL models to discriminativeness-aware models with only a single-epoch fine-tuning on the part of the parameters. Extensive experiments demonstrate that our methods significantly enhance the discriminativeness of off-the-shelf RL models and even outperform previous discriminativeness-aware methods with much smaller computational costs. Detailed analysis and human evaluation also verify that our methods boost the discriminativeness without sacrificing the overall quality of captions.
翻译:偏差是图像说明的一个可取的特征: 标题应该描述输入图像的特征细节。 但是, 最近的高性能字幕模型, 受过强化学习( RL) 的培训, 往往产生过于通用的字幕, 尽管它们在各种其他标准中表现很高。 首先, 我们调查出意想不到的低差别性的原因, 并表明 RL 具有将输出单词限制在高频单词上的根深蒂固的副作用。 有限的词汇是歧视性的严重瓶颈, 因为模型很难描述其词汇以外的细节。 然后, 根据对瓶颈的识别, 我们大幅重新刻画歧视性图像, 将其作为鼓励低频率生成单词的简单任务。 我们受到长尾分类和贬低偏差方法的驱使, 我们提出一些方法, 方便地将现成的RL 模型切换成歧视性觉悟性模型, 仅对参数部分进行单小微的微调。 广泛实验表明, 我们的方法大大加强了被取代的RL 模型的区别性形象描述, 并且超越了我们之前的精确性质量分析方法。