Is it possible to leverage large scale raw and raw parallel corpora to build a general learned metric? Existing learned metrics have gaps to human judgements, are model-dependent or are limited to the domains or tasks where human ratings are available. In this paper, we propose SEScore2, a model-based metric pretrained over million-scale synthetic dataset constructed by our novel retrieval augmented data synthesis pipeline. SEScore2 achieves high correlation to human judgements without any human rating supervisions. Importantly, our unsupervised SEScore2 can outperform supervised metrics, which are trained on the News human ratings, at the TED domain. We evaluate SEScore2 over four text generation tasks across three languages. SEScore2 outperforms all prior unsupervised evaluation metrics in machine translation, speech translation, data-to-text and dialogue generation, with average Kendall improvements 0.158. SEScore2 even outperforms SOTA supervised BLEURT at data-to-text, dialogue generation and overall correlation.
翻译:能否利用大规模原始和原始平行公司来构建一个普遍学习的衡量标准? 现有的学习指标在人类判断方面存在差距,取决于模型,或限于人类评级的领域或任务。在本文中,我们提议SEScore2,一个由我们的新检索所建造的、以模型为基础的、预先训练的超过100万比例的合成数据集。SEScore2在没有任何人类评级监督的情况下,实现了与人类判断的高度相关性。重要的是,我们未经监督的SEScore2, 能够优于在TED领域接受关于《新闻》人类评级培训的受监督的衡量标准。我们评估SEScore2, 而不是在三种语言的四种文本生成任务中。SEScore2比所有先前未经监督的机器翻译、语音翻译、数据对文本和对话生成的评价指标都好,平均Kendall改进0.18。SEScore2甚至比SOTA还差。 在数据对文本、对话生成和总体相关性方面,SLEURRT监督的SERT。