A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation; the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, Mauve approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation -- in either theory or practice. This begs the question: why does Mauve work so well? In this work, we show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
翻译:用于语言生成的良好自动评价标准与人类对文本质量的判断高度相关。 然而,这种衡量标准缺乏,这抑制了语言生成者的快速和高效进展。 一个例外是最近提议的 Mauve。 理论上, Mauve 测量了两个概率分布之间在字符串上的信息理论差异: 一个代表了正在评价的语言生成者; 另一个代表了真实的自然语言分布。 Mauve 的作者认为,它的成功来自其拟议差异的质量特性。 然而,在实践上,由于这种差异是无法令人理解的, Mauve 通过测量组群之间多名分布之间的差异来接近它,而这种差异又抑制了语言生成者之间的快速和高效进展。 最近, Mauve 测量了组群集任务是通过基于预先训练的语言模式嵌入的字符串组合来达到的。 然而,正如我们所显示的那样,这不是一个紧凑的近似点 -- 无论是在理论还是实践上,另一个代表了真正的自然语言分布。 在这项工作中, Mauve 为何工作如此顺利? 我们表明, Mauve 可能基于错误的原因是正确的, 以及它新提出的差异对于其高性水平并不必要。