HugAgent：面向个体化人类推理模拟的大语言模型基准测试 (HugAgent: Benchmarking LLMs for Simulation of Individualized Human Reasoning)

Chance Jiajie Li,Zhenze Mo,Yuhan Tang,Ao Qu,Jiayi Wu,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Hang Jiang,Paul Pu Liang,Jinhua Zhao,Luis Alberto Alonso Pastor,Kent Larson

from arxiv, To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW)

Simulating human reasoning in open-ended tasks has long been a central aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), which rethinks human reasoning simulation along three dimensions: (i) from averaged to individualized reasoning, (ii) from behavioral mimicry to cognitive alignment, and (iii) from vignette-based to open-ended data. The benchmark evaluates whether a model can predict a specific person's behavioral responses and the underlying reasoning dynamics in out-of-distribution scenarios, given partial evidence of their prior views. HugAgent adopts a dual-track design: a human track that automates and scales the think-aloud method to collect ecologically valid human reasoning data, and a synthetic track for further scalability and systematic stress testing. This architecture enables low-cost, extensible expansion to new tasks and populations. Experiments with state-of-the-art language models reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. The benchmark, along with its complete data collection pipeline and companion chatbot, is open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

翻译：在开放任务中模拟人类推理长期以来一直是人工智能与认知科学领域的核心目标。尽管大语言模型目前能够大规模近似人类响应，但其仍主要面向群体层面的共识进行优化，往往抹除了推理风格与信念轨迹的个体性。为实现更具人类特质的机器推理愿景，我们提出了HugAgent（人类基元智能体基准），该基准从三个维度重构人类推理模拟框架：（i）从平均化推理转向个体化推理，（ii）从行为模仿转向认知对齐，（iii）从片段式数据转向开放式数据。该基准通过给定个体既往观点的部分证据，评估模型能否在分布外场景中预测特定对象的行为响应及其背后的推理动态机制。HugAgent采用双轨制设计：人类轨道通过自动化扩展的出声思考法收集生态效度的人类推理数据，合成轨道则用于进一步扩展规模与系统性压力测试。该架构支持以低成本可扩展的方式适配新任务与人群。基于前沿语言模型的实验揭示了持续存在的适应鸿沟，使HugAgent成为首个可扩展的、旨在实现机器推理与人类思维个体性对齐的基准测试。本基准及其完整数据采集流程与配套聊天机器人已开源为HugAgent（https://anonymous.4open.science/r/HugAgent）与TraceYourThinking（https://anonymous.4open.science/r/trace-your-thinking）。