Automatic evaluation is beneficial for open-domain dialog system development. However, standard word-overlap metrics (BLEU, ROUGE) do not correlate well with human judgements of open-domain dialog systems. In this work we propose to use the sentiment of the next user utterance for turn or dialog level evaluation. Specifically we propose three methods: one that predicts the next sentiment directly, and two others that predict the next user utterance using an utterance or a feedback generator model and then classify its sentiment. Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
翻译:自动评估有利于开发开放式域对话框系统。 但是, 标准的单词重叠度量( PLU, ROUGE) 与人对开放式域对话框系统的判断并不完全相关。 在此工作中, 我们提议使用下一个用户的发声或对话框级别评价。 具体地说, 我们提出了三种方法: 一种直接预测下一发情, 另一种是使用发音或反馈生成模型预测下一个用户的发音, 然后对其感知进行分类 。 实验显示我们的模型在书面和口述开放式域对话数据集上都比现有的自动评价度量值高 。