Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.
翻译:图像说明模型要求高层次的概括能力来用文字描述各种图像的内容。大多数现有方法在培训中平等对待图像描述对对,而没有考虑到他们学习困难的差异。几种图像说明方法采用的课程学习方法显示培训数据难度越来越大。但是,它们的难度衡量要么基于具体领域的特点,要么基于先前的示范培训。在本文中,我们建议对使用预先培训的视觉语言模型所计算的跨模式相似性进行图像描述的简单而有效的困难计量。COCO和Flickr30k数据集的实验表明,我们拟议的方法在不要求超自然学或额外培训费用的情况下,实现了与基线的更高性能和竞争性趋同速度。此外,困难实例和不可见数据的较高模型性能也显示了一般化能力。