Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations. All code is open-source at Github.
翻译:大型语言模型(LLM)正日益作为自主智能体发挥作用,其通过模型上下文协议(MCP)利用外部工具被视为未来趋势。当前的MCP评估数据集存在依赖外部MCP服务、缺乏难度感知等问题。为应对这些局限,我们提出了MCPAgentBench——一个基于真实世界MCP定义、旨在评估智能体工具使用能力的基准。我们构建了包含真实任务与模拟MCP工具的数据集。评估采用动态沙箱环境,向智能体提供包含干扰项的候选工具列表,从而测试其工具选择与辨别能力。此外,我们引入了综合指标以同时衡量任务完成率与执行效率。在多种最新主流大型语言模型上进行的实验表明,各模型在处理复杂多步骤工具调用时存在显著性能差异。所有代码已在Github开源。