Automatic evaluation of open-domain dialogs remains an unsolved problem. Moreover, existing methods do not correlate strongly with human annotations. This paper presents a new automated evaluation method using follow-ups: we measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g., not really relevant here, what are you trying to say). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.
翻译:开放域对话框的自动评估仍然是一个尚未解决的问题。 此外, 现有方法与人的注释没有很强的联系。 本文介绍了一种使用后续跟踪的新的自动评估方法: 我们测量语言模式是否将继续与固定的后续跟踪对话( 比如, 这里并不真正相关, 您想说什么 ) 。 与12种现有方法相比, 我们的新评估实现了与人类评估的最高相关性 。