The open-ended nature of visual captioning makes it a challenging area for evaluation. The majority of proposed models rely on specialized training to improve human-correlation, resulting in limited adoption, generalizability, and explainabilty. We introduce "typicality", a new formulation of evaluation rooted in information theory, which is uniquely suited for problems lacking a definite ground truth. Typicality serves as our framework to develop a novel semantic comparison, SPARCS, as well as referenceless fluency evaluation metrics. Over the course of our analysis, two separate dimensions of fluency naturally emerge: style, captured by metric SPURTS, and grammar, captured in the form of grammatical outlier penalties. Through extensive experiments and ablation studies on benchmark datasets, we show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences. Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
翻译:视觉字幕的开放性使得它成为评估的一个挑战领域。大多数拟议模型依靠专门培训来改进人类关系,从而导致有限的采纳、普遍性和解释性。我们引入了“典型性”——一种基于信息理论的评价新提法,这种新提法特别适合缺乏明确地面事实的问题。典型性作为我们制定新的语义比较的框架,SPARCS和不参考的流利评价指标。在我们的分析过程中,流利自然出现了两个不同的层面:风格,由SPURTS衡量,和语法,以语法外值惩罚的形式捕捉。通过对基准数据集的广泛试验和通俗化研究,我们展示了这些分解的语义和流利的维度如何在系统层面对字幕差异提供更深入的洞察力。我们提议的衡量尺度及其组合SMURF,与其他有章的评价指标相比,实现了与人类判断的状态和艺术相关性。