Modern embedding-based metrics for evaluation of generated text generally fall into one of two paradigms: discriminative metrics that are trained to directly predict which outputs are of higher quality according to supervised human annotations, and generative metrics that are trained to evaluate text based on the probabilities of a generative model. Both have their advantages; discriminative metrics are able to directly optimize for the problem of distinguishing between good and bad outputs, while generative metrics can be trained using abundant raw text. In this paper, we present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. We perform an extensive empirical comparison with other existing metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility of our method. Experimental results show that: T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level. We release our code and models at https://github.com/qinyiwei/T5Score.
翻译:用于评价生成文本的基于嵌入的现代指标通常可分为两种模式之一:受培训直接预测根据受监督的人的注释而具有更高质量的产出的歧视性指标;受培训根据基因模型概率评价文本的遗传性指标。两者都有其优点; 歧视性指标能够直接优化对好产出和坏产出的区分问题,而基因指标可以用丰富的原始文本来培训。 在本文件中,我们提出了一个框架,利用我们所掌握的任何数据提供的受监督和不受监督的信号,将两个世界的最佳数据结合起来。我们通过培训T5Score来落实这一想法。T5Score是使用MT5作为骨干的培训信号的衡量标准。我们与其他5个数据集、19种语言和280个系统的现有指标进行了广泛的实证比较,显示了我们的方法的效用。实验结果显示:T5Scoi将所有数据集的性能与现有最高分级指标的最佳效果结合起来。我们在 https://github.com/inriiQTweyT中发布了我们的代码和模型。