As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench
翻译:随着智能体化人工智能系统日益自主运行,通过可验证的评估建立信任变得至关重要。然而,现有基准测试缺乏评估智能体行为可靠性的透明度和可审计性。我们提出了DrawingBench,一个通过需要生成低级图形用户界面(GUI)操作序列的空间推理任务来评估智能体化大型语言模型(LLM)可信度的验证框架。与不透明的评估不同,DrawingBench提供了透明的、基于规则的评估:8个客观标准实现了可复现的评分,而操作级别的检查允许利益相关者审计智能体行为。我们的框架包含涵盖20个类别和4个难度级别的250个多样化提示、确定性评估指标,以及通过多轮反馈实现的外部监督机制,使人类能够控制智能体的优化。通过对四种最先进的LLM(Claude-4 Sonnet、GPT-4.1、GPT-4.1-mini、Gemini-2.5 Flash)进行1,000次测试评估,我们确定了其能力与局限性:模型在结构化外部反馈的驱动下实现了92.8%的完美性能(平均提升+3.2%,复杂场景最高提升+32.8%),但在工具状态管理和长程规划方面出现了系统性错误模式。值得注意的是,规范清晰度被证明比任务复杂性更为重要——当提供明确、可验证的标准时,模型实现了100%的完美性能。这些发现表明,透明评估框架可以在智能体系统中建立信任,且外部监督在引导智能体行为方面比自我修正更为可靠。我们的开源框架为可信智能体评估提供了模板。代码与数据:https://github.com/hyunjun1121/DrawingBench