In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore ignores truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation.
翻译:在这项工作中,我们探索了一种有用但往往被忽视的方法,用于对文本生成评价指标进行稳健性分析:用合成数据进行压力测试。基本上,我们设计和综合了各种潜在错误,检查它们是否导致计量得分相应下降。我们审查了最近根据预先培训的语言模型提出的一系列评价指标,以完成开放式生成、翻译和总结的任务。我们的实验揭示了现有计量指标中令人感兴趣的不敏感、偏见甚至漏洞。例如,我们发现BERTScore忽略了总和中的脱节错误,而MAUVE(建在GPT-2顶端)对几代人初期的错误不敏感。此外,我们调查了这些盲点背后的原因,并为更可靠地评估文本生成提出了实际的变通办法。