Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ
翻译:人造情报(AI)研究的一个共同目标是创造能自然地与人类互动的代理物。然而,评估这些互动是具有挑战性的:收集在线人体剂互动缓慢而昂贵,但更快的代用指标往往与互动评价不相干。在本文件中,我们评估了这些现有评价指标的优点,并提出了一种新的评价方法,称为标准测试套(STS)。STS使用从真实的人类互动数据中提取出来的行为假设情景。代理物看到重播情景背景,接受指令,然后被控制完成离线互动。这些代理物延续物被记录下来并发送给人类告示者,标记成成功或失败,代理物按其成功延续的比例排列。由此产生的STS是快速、控制、可解释的,并代表自然互动。总体而言,STS综合了我们许多标准评价指标中的许多可取之处,使我们能够加快研究进展,从而产生可以与人类自然互动的代理物。https://youtu.be/YR1TGORGRQ。