With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.
翻译:随着基于大语言模型(LLM)的智能体快速发展,将智能体专用数据纳入LLM预训练阶段的趋势日益明显,旨在使LLM更好地适应现实世界中的自主任务执行。然而,当前的预训练基准主要关注孤立且静态的技能,例如常识或数学/代码推理,未能反映模型的智能体能力。另一方面,智能体基准通常为后训练模型设计,需要多轮任务执行能力,而基础模型难以支持。因此,迫切需要一种能在预训练期间评估智能体潜力并更有效指导模型训练的基准。为填补这一空白,我们提出APTBench,一个将真实世界智能体任务和成功轨迹转化为适用于基础模型的多选题或文本补全问题的框架。它聚焦于核心智能体能力,如规划与行动,并覆盖关键智能体场景,包括软件工程和深度研究。与现有通用基准相比,APTBench能更有效地预测模型作为智能体的下游性能,同时比后训练后进行的全规模端到端智能体评估显著更轻量且成本效益更高。