A new metric \texttt{BaryScore} to evaluate text generation based on deep contextualized embeddings (\textit{e.g.}, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, \textit{i.e.}, Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions (\textit{e.g.}, MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that \texttt{BaryScore} outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
翻译:根据深背景嵌入(\ textit{ e.}, BERT, Roberta, ELMO) 来评估文本生成情况。 引入了一个新的衡量标准, 其动机是依靠一个依靠最佳运输工具的新框架,\ textit{ i. e.}, Wasserstein 距离和 baycenter 。 通过模拟深背景嵌入的层输出, 将其作为概率分布而不是矢量嵌入; 这个框架提供了一种自然的方式, 通过瓦塞斯坦空间表层集成不同输出。 此外, 它为我们的测量提供了理论依据, 并提供了替代可用解决方案的替代方法(\ textit{ 如}, MolerScore 和 BertScore ) 。 数字评价是在四种不同的任务上进行的: 机器翻译、 汇总、 数据2 文本生成和图像说明。 我们的结果表明, \ textt{BaryScore} 超越了其他基于 BERT 的测量标准, 并展示出更一致的行为, 特别是文本总称 。