In this paper, we propose a test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.
翻译:本文提出一种测试时自适应智能体,该智能体通过后验引导的信念优化执行探索式推理,无需依赖基于梯度的更新或对部分可观测环境下运行的LLM智能体进行额外训练。我们的智能体在环境状态上维护外部结构化信念,通过动作条件观测迭代更新该信念,并通过最大化信念空间上的预测信息增益来选择动作。我们使用基于轻量级LLM的代理模型估计信息增益,并通过一种量化后验信念与真实环境配置间一致性的新型奖励机制评估世界对齐性。实验表明,在潜在世界状态对齐任务中,我们的方法优于提示增强或检索增强型LLM等推理时扩展基线方法,且集成开销显著降低。