A new metric \texttt{BaryScore} to evaluate text generation based on deep contextualized embeddings e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions e.g., MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that \texttt{BaryScore} outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.
翻译:引入了一个新的衡量标准 \ textt{BaryScore} 来评估基于深背景嵌入的文本生成, 如 BERT、 Roberta、 ELMO) 。 该衡量标准是由依靠最佳运输工具的新框架驱动的。 通过将深背景嵌入的层输出建模为概率分布而不是矢量嵌入; 这个框架提供了一种自然的方法, 通过瓦瑟斯坦空间表层来汇总不同输出。 此外, 它为我们的衡量标准提供了理论依据, 并为现有解决方案提供了替代方案, 如 MolerScore 和 BertScore ) 。 数字评价是在四种不同的任务上进行的: 机器翻译、 汇总、 数据2 文本生成和图像说明。 我们的结果表明,\ textt{BaryScore} 超越了其他基于 BERT 的测量标准, 并展示了更一致的行为, 特别是文本概括。