Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
翻译:语言模型(LMS)的许多实际应用,如代码自动化和写作协助,涉及人与LM互动,但主要的LM基准是非互动性的,一个系统可以产生没有人干预的产出。为了评估人与LM互动,我们开发了一个框架,即人类-AI语言互动评价(H-LINE),将非互动性评价扩大到三个层面,捕捉(一)互动进程,而不仅仅是最后产出;(二)第一人主观经验,而不仅仅是第三方评估;以及(三)超越质量的偏好概念。然后我们设计五项任务,从目标导向型到开放型到捕捉不同形式的互动。关于四种最先进的LMS(OpAI GPT-3和AI21 J1-Jumbo的三种变式),我们发现,非互动性业绩并不总是导致更好的人与LM互动,第一人和第三方衡量标准可以不同,表明研究人类与LM互动的细微差别的重要性。