Recently, there is a surge of interest in applying pre-trained language models (Pr-LM) in automatic open-domain dialog evaluation. Pr-LMs offer a promising direction for addressing the multi-domain evaluation challenge. Yet, the impact of different Pr-LMs on the performance of automatic metrics is not well-understood. This paper examines 8 different Pr-LMs and studies their impact on three typical automatic dialog evaluation metrics across three different dialog evaluation benchmarks. Specifically, we analyze how the choice of Pr-LMs affects the performance of automatic metrics. Extensive correlation analyses on each of the metrics are performed to assess the effects of different Pr-LMs along various axes, including pre-training objectives, dialog evaluation criteria, model size, and cross-dataset robustness. This study serves as the first comprehensive assessment of the effects of different Pr-LMs on automatic dialog evaluation.
翻译:最近,人们对在自动开放式对话评价中应用预先培训的语言模型(Pr-LM)的兴趣激增。Pr-LM为应对多领域评价挑战提供了很有希望的方向。然而,不同的PR-LM对自动衡量标准性能的影响并没有得到很好理解。本文审查了8个不同的PR-LM,并研究了其对三种不同的对话评价基准的三种典型自动对话评价指标的影响。具体地说,我们分析了PR-LM的选择如何影响自动衡量标准的性能。对每一项指标进行了广泛的相关分析,以评估不同Pr-LM在各种轴线上的影响,包括培训前目标、对话评价标准、模型大小和交叉数据的稳健性。这项研究是对不同PR-LM对自动对话评价的影响的第一次全面评估。