We propose BERTScore, an automatic evaluation metric for text generation. Analogous to common metrics, \method computes a similarity score for each token in the candidate sentence with each token in the reference. However, instead of looking for exact matches, we compute similarity using contextualized BERT embeddings. We evaluate on several machine translation and image captioning benchmarks, and show that BERTScore correlates better with human judgments than existing metrics, often significantly outperforming even task-specific supervised metrics.
翻译:我们建议采用BERTScore( BERTScore ), 即一个用于文本生成的自动评价指标。 与通用指标相比,\ 方法计算了候选句子中每个符号的相似性分数, 引用的每个符号。 但是, 我们不是寻找精确匹配, 而是使用背景化的 BERT 嵌入器来计算相似性。 我们评估了几个机器翻译和图像字幕基准, 并显示 BERTScore 与人类判断的关系比现有指标要好, 通常大大超过特定任务的监督指标。