Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains an open problem. We explore the potential of dialogue evaluation featuring dialog act information, which was hardly explicitly modeled in previous methods. However, defined at the utterance level in general, dialog act is of coarse granularity, as an utterance can contain multiple segments possessing different functions. Hence, we propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it. To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval. This framework provides a reference-free approach for dialog evaluation by finding pseudo-references. Extensive experiments against strong baselines on three benchmark datasets demonstrate the effectiveness and other desirable characteristics of our FlowEval, pointing out a potential path for better dialogue evaluation.
翻译:尽管在开放域对话评价方面最近取得了进展,但如何制定自动指标仍是一个尚未解决的问题。我们探索对话评价的潜力,其特点是对话行为信息,在以前的方法中,这种信息很少被明确效仿。然而,一般而言,在发言一级,对话行为是粗粗的颗粒,因为一种言论可以包含具有不同功能的多个部分。因此,我们提议采取分部分行动,将对话行为从发声级别扩大到分层一级,并为它提供大规模数据集。为了评估,我们开发了第一个基于共识的对话评价框架,即RlowEval。这个框架为对话评价提供了一种无参考方法,通过寻找假参照。针对三个基准数据集的强大基线进行的广泛实验,显示了我们流动值的有效性和其他可取特征,指出了改进对话评价的潜在途径。