While probabilistic language generators have improved dramatically over the last few years, the automatic evaluation metrics used to assess them have not kept pace with this progress. In the domain of language generation, a good metric must correlate highly with human judgements. Yet, with few exceptions, there is a lack of such metrics in the literature. In this work, we analyse the general paradigm of language generator evaluation. We first discuss the computational and qualitative issues with using automatic evaluation metrics that operate on probability distributions over strings, the backbone of most language generators. We then propose the use of distributions over clusters instead, where we cluster strings based on their text embeddings (obtained from a pretrained language model). While we find the biases introduced by this substitution to be quite strong, we observe that, empirically, this methodology leads to metric estimators with higher correlation with human judgements, while simultaneously reducing estimator variance. We finish the paper with a probing analysis, which leads us to conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- these clusters may simply be better equipped to evaluate state-of-the-art language models.
翻译:虽然过去几年来概率语言生成器有了显著改善,但用于评估这些生成器的自动评价指标却跟不上这一进展。在语言生成领域,一个良好的衡量标准必须与人类判断高度相关。然而,除了少数例外,文献中缺乏这样的衡量标准。在这项工作中,我们分析语言生成器评价的一般模式。我们首先通过使用自动评价指标来讨论计算和质量问题,该评价指标以大多数语言生成器的骨干 -- -- 字符串之间的概率分布方式运作。我们然后提议使用在组群之间的分布,而我们则在组群的文本嵌嵌入(由预先培训的语言模型组成)的基础上将字符分组。我们虽然发现这种替换带来的偏差相当强烈,但从经验上看,我们发现这一方法导致与人类判断器的关联性更高,同时缩小了估计器的差异。我们先用一个预测性分析来完成文件,从而得出这样的结论:通过编码综合和一致性水平的文本特征,同时忽略地表水平的特征,这些组群群群群群可能更有能力评估状态语言模型。