Social intelligence and Theory of Mind (ToM), i.e., the ability to reason about the different mental states, intents, and reactions of all people involved, allow humans to effectively navigate and understand everyday social interactions. As NLP systems are used in increasingly complex social situations, their ability to grasp social dynamics becomes crucial. In this work, we examine the open question of social intelligence and Theory of Mind in modern NLP systems from an empirical and theory-based perspective. We show that one of today's largest language models (GPT-3; Brown et al., 2020) lacks this kind of social intelligence out-of-the box, using two tasks: SocialIQa (Sap et al., 2019), which measures models' ability to understand intents and reactions of participants of social interactions, and ToMi (Le et al., 2019), which measures whether models can infer mental states and realities of participants of situations. Our results show that models struggle substantially at these Theory of Mind tasks, with well-below-human accuracies of 55% and 60% on SocialIQa and ToMi, respectively. To conclude, we draw on theories from pragmatics to contextualize this shortcoming of large language models, by examining the limitations stemming from their data, neural architecture, and training paradigms. Challenging the prevalent narrative that only scale is needed, we posit that person-centric NLP approaches might be more effective towards neural Theory of Mind. In our updated version, we also analyze newer instruction tuned and RLFH models for neural ToM. We find that even ChatGPT and GPT-4 do not display emergent Theory of Mind; strikingly even GPT-4 performs only 60% accuracy on the ToMi questions related to mental states and realities.
翻译:摘要:社交智能和心理理论,即理解各个相关人员的不同心理状态、意图和反应的能力,使人类能够有效地在日常社交互动中导航和理解。随着 NLP 系统在越来越复杂的社交情境中的应用,它们理解社交动态的能力变得至关重要。在这项研究中,我们从理论和实证的角度探讨了现代 NLP 系统中的社交智能和心理理论这一未解之谜。我们使用 SocialIQa(Sap 等人,2019)和 ToMi(Le 等人,2019)两个任务来展示 GPT-3(Brown等人,2020)这个当今最大的语言模型缺乏这种社交智能。SocialIQa 任务衡量了模型理解社交互动中各个参与者的意图和反应的能力,而 ToMi 任务则衡量是否能够推断情境中参与者的心理状态和真实性。我们的研究结果显示,这些神经网络模型在这些心理理论任务上表现不佳,SocialIQa 和 ToMi 任务的准确率均低于人类水平,分别为 55% 和 60%。最后,我们结合语用学理论来分析语言模型的局限性,包括数据、神经网络结构和训练模式等方面。挑战只追求模型规模的普遍观点,我们认为倾向于个人中心的 NLP 方法可能更有效地实现神经心智理论。在我们的更新版本中,我们还分析了最新的指令调整和 RLFH 神经网络模型用于神经心智理论。我们发现,Even ChatGPT 和 GPT-4 都没有展现出心理理论的能力。令人惊讶的是,即使是 GPT-4 在关于心理状态和真实性的 ToMi 问题上的准确率也只有 60%。