Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.
翻译:在图像字幕的自动评价度量度中,易感性易感性是图文扰动性易感性的一个关键弱点。本文件提议采用“易感性强”的多链式 CLIPCore(PR-MCS),它作为适用于多种语言的新颖的无参考图像说明度度度度值,展示了这种扰动性强。为了实现扰动稳健性,我们将CLIP的文本编码器与原始文本进行微调,以区分被扰动的文本和原始文本。为了验证PR-MCS的稳健性,我们引入了一个新的精细精细的评估数据组,包括详细的说明、关键对象以及5种语言3 000张图像对象之间的关系。 在我们的实验中,PR-MCS在以所有5种语言捕捉各种扰动性类型的词汇性噪音时,明显优于基线度度度度值,证明PR-MCS对词汇性扰动性扰动性极强。</s>