累积上下文改变语言模型的信念 (Accumulating Context Changes the Beliefs of Language Models)

Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and behaviors. Our results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.

翻译：语言模型助手在头脑风暴和研究等应用中的使用日益广泛。内存和上下文窗口容量的提升使这些模型更具自主性，这也导致其上下文窗口中的文本在没有用户明确干预的情况下不断累积。这带来了一个潜在风险：模型的信念轮廓——即其通过响应或行为所体现的对世界的理解——可能随着上下文的累积而悄然改变。这可能导致用户体验出现细微的不一致，或使模型行为偏离原有的对齐目标。本文探讨了通过参与交互和处理文本——即交谈与阅读——累积上下文如何改变语言模型的信念，并体现在其响应和行为中。我们的研究结果表明，模型的信念轮廓具有高度可塑性：GPT-5在经历10轮关于道德困境和安全问题的讨论后，其陈述信念发生了54.7%的偏移；而Grok 4在阅读对立立场文本后，政治议题上的信念偏移达27.2%。我们还通过设计需要工具使用的任务来考察模型的行为变化，其中每个工具选择都对应着一种隐含信念。研究发现这些变化与陈述信念的偏移相一致，表明信念偏移将在智能体系统的实际行为中得到体现。我们的分析揭示了模型在经历长时间交谈或阅读会话后发生信念偏移的潜在风险，这将导致其观点和行为变得不可靠。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日