Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We provide both empirical and theoretical evidence to show that our method effectively removes the biases existing in the original distinct score. Our experiments show that our proposed metric, \textit{Expectation-Adjusted Distinct (EAD)}, correlates better with human judgment in evaluating response diversity. To foster future research, we provide an example implementation at \url{https://github.com/lsy641/Expectation-Adjusted-Distinct}.
翻译:然而,我们注意到,最初计算不同分数的方法有明显的偏差,往往会给较长的顺序规定更严厉的惩罚。我们根据期望,通过缩放不同标记的数量,改进不同分数的计算方法。我们提供了经验证据和理论证据,以表明我们的方法有效地消除了原有不同分数中存在的偏见。我们的实验表明,我们提议的衡量标准,\ textit{Expectation-Adjected Excredit(EAD)}与人类评估回应多样性的判断有更好的关联。为了促进未来的研究,我们在\url{https://github.com/lsy641/Explectation-Addjustried-Distinct}中提供了一个实例。