Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.
翻译:尽管广泛使用LLMs作为谈话代理人,但绩效评估未能抓住沟通的一个关键方面:在背景中解释语言;人类使用信仰和对世界的先前知识来解释语言;例如,我们直觉理解对“你留下指纹了吗?” 问题的“我戴手套”的回答是“没有”。 为了调查LLMs是否有能力作出这种类型的推论,即所谓的暗示,我们设计了一个简单的任务,评价广泛使用的最新模式。我们发现,尽管我们只是评价需要二元推论的言论(是或不是),但多数都接近于随机性。适应“符合人类意图”的模式效果更好,但仍然显示与人类表现的巨大差距。我们提出我们的调查结果,作为进一步开展研究的起点,以评价LMs在背景中如何解释语言,并推动开发更加务实和有用的人类对话模式。