At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.
翻译:改善对话性AI的核心是如何评价对话的公开问题。自动计量的问题众所周知(Liu等人,2016年,ArXiv:1603.08023),人类评价仍然被视为黄金标准。不幸的是,如何进行人类评价也是一个开放问题:不同的数据收集方法在人类的一致程度和统计敏感性方面各不相同,导致人类记事时数和劳动成本的不同。在这项工作中,我们比较了五种基于人群的人类评价方法,发现不同方法最好取决于模型类型,而不同方法则取决于不同的模型类型,没有明显的全局赢家。这凸显了该领域的公开问题,但我们的分析导致建议何时使用哪种方法,以及可能的未来方向。