As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We propose Mauve, a comparison measure for open-ended text generation, which directly compares a generation model's distribution to that of human-written text. Mauve measures the mean area under a divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. Mauve extends a family of information divergence metrics, introducing a tractable approximation based on computing the KL divergence in a quantized embedding space. This yields an efficient implementation that scales up to modern text generation models. Through an extensive empirical study on three open-ended generation tasks, we find that Mauve identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.
翻译:由于在不限名额的文本生成方面取得重大进展,衡量机器生成的文本与人文的接近程度仍然是一个重要的未决问题。我们建议对不限名额的文本生成进行对比,将一代模式的分布与人文文本的分布进行直接比较。在两种分布的差别曲线下测量平均区域,探索两种类型的错误之间的权衡:由模型分布很接近于人文分布的部分产生的错误,以及没有的错误。 Mauve 扩展了一个信息差异度量度的大家庭,引入了一个基于在量化嵌入空间计算 KL 差异的可移动近似值。这产生了一种有效的实施,将现代文本生成模型的尺度放大。通过对三种不限名额的一代任务进行广泛的实证研究,我们发现Mauve 确定了生成文本的已知属性,与模型大小自然相适应,与人类判断相关,限制比现有的分配评价度值要少。