Despite major advances in open-ended text generation, there has been limited progress in designing evaluation metrics for this task. We propose MAUVE -- a metric for open-ended text generation, which directly compares the distribution of machine-generated text to that of human language. MAUVE measures the mean area under the divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. We present experiments across two open-ended generation tasks in the web text domain and the story domain, and a variety of decoding algorithms and model sizes. Our results show that evaluation under MAUVE indeed reflects the more natural behavior with respect to model size, compared to prior metrics. MAUVE's ordering of the decoding algorithms also agrees with that of generation perplexity, the most widely used metric in open-ended text generation; however, MAUVE presents a more principled evaluation metric for the task as it considers both model and human text.
翻译:尽管在不限名额的文本生成方面取得重大进展,但在设计这项任务的评价指标方面进展有限。 我们提议了MAUVE -- -- 一种不限名额的文本生成指标,直接将机器产生的文本的分布与人类语言的分布进行比较。MAUVE衡量两种分布差异曲线下的平均区域,探索两种类型的差错之间的权衡:由模型分布很接近的人类分布部分产生的差错,以及它没有产生的差错。我们在网络文本域和故事域中提出了两个不限名额的一代任务,以及各种解码算法和模型大小的试验。我们的结果显示,与以前的指标相比,MAUVE下的评价确实反映了模型大小方面更自然的行为。MAUVE对解码算法的排序也与代数的混乱(在不限名额的文本中最广泛使用的指标)相一致;然而,MAUVE提出了一种更有原则的衡量标准,因为它既考虑到模型,又考虑到人文。