Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
翻译:图形用户界面(GUI)智能体能够通过自动化移动设备上频繁执行的长延迟任务,显著提升工作效率。然而,现有的评估基准仍受限于有限的应用范围、简单的任务设计以及粗粒度的评价指标。为此,我们提出了AndroidLens,一个面向移动端GUI智能体的挑战性评估框架,包含中英文环境下的571个长延迟任务,每个任务平均需要超过26步操作才能完成。该框架具备以下特点:(1)任务源自38个领域的真实用户场景,涵盖多约束、多目标及领域特定任务等复杂类型;(2)静态评估保留了真实环境中的异常情况,并允许多条有效执行路径以降低偏差;(3)动态评估采用基于里程碑的方案,通过平均任务进度(ATP)实现细粒度的进度度量。我们的评估结果表明,即使最优模型也仅达到12.7%的任务成功率和50.47%的ATP。此外,我们重点指出了真实环境中的关键挑战,包括环境异常、自适应探索以及长期记忆保持等问题。