While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings.
翻译:虽然自动业绩衡量标准对于机器学习人造类似行为至关重要,但用于评价的金质标准仍然是人类判断。对体现的谈话代理人中人造类似行为的主观评价虽然费用昂贵,但对数据质量的主观评价却鲜为人知。主观评价的两种方法可以大为区分,一种是依靠评级,另一种是双向比较。在这项研究中,我们使用共同说话的手势来比较两者,并回答关于两者是否适合评价人造行为的问题。我们考虑它们是否有能力评定质量,但也考虑与使用努力和收集主观数据所需时间有关的方面。我们利用众包来评分阿凡达人共同说话手势的质量,评估哪一种方法在主观评估中能更详细地反映细节。我们比较了三种不同机器学习模式和不同程度行为质量的手势。我们发现,这两种方法都能够按照质量对录像进行分级,而排列得相当相近,表明在质量方面,一种方法并不优于另一种方法。我们还发现,对口比较稍快,而且与改进了同级之间的比较是比较。