Multiple different responses are often plausible for a given open domain dialog context. Prior work has shown the importance of having multiple valid reference responses for meaningful and robust automated evaluations. In such cases, common practice has been to collect more human written references. However, such collection can be expensive, time consuming, and not easily scalable. Instead, we propose a novel technique for automatically expanding a human generated reference to a set of candidate references. We fetch plausible references from knowledge sources, and adapt them so that they are more fluent in context of the dialog instance in question. More specifically, we use (1) a commonsense knowledge base to elicit a large number of plausible reactions given the dialog history (2) relevant instances retrieved from dialog corpus, using similar past as well as future contexts. We demonstrate that our automatically expanded reference sets lead to large improvements in correlations of automated metrics with human ratings of system outputs for DailyDialog dataset.
翻译:对于给定的开放域域对话框背景来说,多种不同的答复往往很合理。 先前的工作已经表明,对于有意义和稳健的自动评价,必须有多重有效的参考答复。 在这类情况下,通常的做法是收集更多的人文书面参考资料。 但是,这种收集可能费用昂贵、耗时且不易缩放。 相反,我们提出了一种新技术,用于自动扩展人为生成的一组候选参考文献。 我们从知识源中获取可信的参考文献,并对其进行调整,以便在相关对话框实例中更加流畅。 更具体地说,我们使用(1) 一个常识知识库,从对话框中获取大量可信的反应(2) 从对话框中检索的相关案例,同时利用类似的过去和将来的环境。我们证明,我们自动扩大的参考数据集可以大大改善自动计量与Daialog数据集的系统输出的人类评级之间的关系。