The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.
翻译:最近对基于嵌入的文本生成评价指标的评价主要基于衡量其与标准基准人类评价的关联性,然而,这些基准大多来自与培训前用词嵌入所用的类似领域,这引起了对嵌入型指标缺乏向新的和吵闹域的概括化的关切,这些域的词汇与培训前数据不同。在本文件中,我们审查了BERTScore的稳健性,这是最受欢迎的基于嵌入型的文本生成指标之一。我们显示:(a) 嵌入型指标与标准基准人类评价的关联性最高,如果投入噪音或未知符号的数量增加,则与标准基准人类评价的关联性最低;(b) 从第一层预先培训的模型中嵌入能提高所有指标的稳健性,以及(c) 从预培训模式第一层使用字符级嵌入而非象征性嵌入,实现最高强度。