In this paper, we propose QACE, a new metric based on Question Answering for Caption Evaluation. QACE generates questions on the evaluated caption and checks its content by asking the questions on either the reference caption or the source image. We first develop QACE-Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE-Img, which asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE-Img. Unfortunately, the standard VQA models are framed as a classification among only a few thousand categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE-Img is multi-modal, reference-less, and explainable. Our experiments show that QACE-Img compares favorably w.r.t. other reference-less metrics. We will release the pre-trained models to compute QACE.
翻译:在本文中,我们提议QACE-Img(QACE-Img),这是基于对描述评价的问答的新标准。QACE(QACE)在被评估的标题上产生问题,并通过在参考标题或源图像上提出问题来检查其内容。我们首先开发QACE-Ref(QACE-Ref),将被评估的标题的答案与参考标题的答案进行比较,然后将竞争性结果与最新指标报告。更进一步,我们提议QACE-Img(QACE-Img),直接在图像上而不是在参考上提出问题。不幸的是,对于QACE-Img(QACE-Img)来说,需要视觉QQQA(QACE-Img) 系统,但遗憾的是,标准VQQAA(QQQAA) 模型只被设定为几千个类别。相反,我们提议了一个抽象的VQAAA系统。因此,QACE-Img(QACE-Img) 是多式的,没有参考和可解释的。我们的实验显示QACE(QACE-Img) 比较优于W.r.r.t.t.