There has been great recent advancement in human-computer chat. However, proper evaluation currently requires human judgements that produce notoriously high-variance metrics due to their inherent subjectivity. Furthermore, there is little standardization in the methods and labels used for evaluation, with an overall lack of work to compare and assess the validity of various evaluation approaches. As a consequence, existing evaluation results likely leave an incomplete picture of the strengths and weaknesses of open-domain chatbots. We aim towards a dimensional evaluation of human-computer chat that can reliably measure several distinct aspects of chat quality. To this end, we present our novel human evaluation method that quantifies the rate of several quality-related chatbot behaviors. Our results demonstrate our method to be more suitable for dimensional chat evaluation than alternative likert-style or comparative methods. We then use our validated method and existing methods to evaluate four open-domain chat models from the recent literature.
翻译:最近,在人-计算机聊天方面取得了巨大进展。然而,适当的评价目前要求人类的判断,这些判断由于其固有的主观性而产生了臭名昭著的高差异度指标。此外,用于评价的方法和标签几乎没有标准化,普遍缺乏比较和评估各种评价方法有效性的工作。因此,现有的评价结果可能使人们对开放式聊天机的长处和短处产生不完全的印象。我们的目标是对人-计算机聊天进行多层面评价,可靠地衡量聊天质量的若干不同方面。为此目的,我们提出了我们的新颖的人类评价方法,对若干与质量有关的聊天机行为的比率进行了量化。我们的结果表明,我们的方法比其他类似式或比较方法更适合进行面对面的聊天评价。我们然后使用我们经过验证的方法和现有方法来评价最近文献中的四个开放式聊天模式。