A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
翻译:健全的评价指标对文本生成系统的发展具有深远影响。一个可取的衡量标准,将系统产出与基于其语义而非表面形式的参考文献进行比较。我们在本文件中调查了编码系统和参考文本的战略,以设计一种与人类对文本质量判断高度关联的衡量标准。我们验证了我们新的衡量标准,即移动数据核心,它涉及若干文本生成任务,包括摘要化、机器翻译、图像字幕说明和数据对文本生成,这些产出是由各种神经和非神经系统产生的。我们的调查结果表明,将背景化表述与远程测量相结合的衡量标准是最佳的。这些衡量标准还显示了在各项任务中强有力的概括能力。为了便于使用,我们提供我们的衡量标准作为网络服务。