In this work, we focus on improving the captions generated by image-caption generation systems. We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption that maximally captures the visual information in the image. Our re-ranker utilizes the Belief Revision framework (Blok et al., 2003) to calibrate the original likelihood of the top-n captions by explicitly exploiting the semantic relatedness between the depicted caption and the visual context. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system without the necessity of any additional training or fine-tuning.
翻译:在这项工作中,我们侧重于改进图像字幕生成系统生成的字幕。我们提出一种新的重新排序方法,利用视觉和语义措施来识别最能捕捉图像中视觉信息的理想字幕。我们的重新排序者利用信仰修订框架(Blok等人,2003年)来校准顶级字幕的原始可能性,明确利用描述字幕与视觉背景之间的语义关联。我们的实验展示了我们的方法的实用性,我们发现重新排序者可以在不需要任何额外培训或微调的情况下,提高典型图像字幕系统的性能。