ABBEL：通过语言表达的信念瓶颈进行行动的LLM智能体 (ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language)

As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.

翻译：随着序列决策任务长度的增加，将完整的交互历史保存在上下文中变得计算上不可行。我们提出了一个通用框架，使LLM智能体能够在多步交互中保持简洁的上下文：通过语言表达的信念瓶颈进行行动（ABBEL），并提出了通过强化学习后训练进一步改进ABBEL智能体的方法。ABBEL用信念状态——即对任务相关未知信息所发现内容的自然语言摘要——替代了冗长的多步交互历史。在ABBEL框架下，智能体在每一步首先利用环境的最新观察更新先验信念以形成后验信念，随后仅基于此后验信念选择行动。我们在六个不同的多步环境中系统评估了前沿模型在ABBEL下的表现，发现ABBEL在保持交互步数间近乎恒定的内存使用的同时，支持生成可解释的信念。然而，瓶颈方法通常容易产生误差传播，我们观察到由于信念更新中的错误，其性能相较于完整上下文设置有所下降。因此，我们通过强化学习训练LLM在ABBEL框架内生成信念并基于其行动。我们尝试了信念分级以奖励更高质量的信念，以及信念长度惩罚以奖励更压缩的信念。实验表明，强化学习能够将ABBEL的性能提升至超越完整上下文设置的水平，同时其内存占用低于同期其他方法。