Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.
翻译:移动图形用户界面代理,即能够代表用户与移动应用程序交互的人工智能代理,具有革新人机交互方式的潜力。然而,当前针对图形用户界面代理的评估实践面临两个根本性局限。首先,现有方法要么依赖单路径离线基准测试,要么采用在线实时基准测试。使用静态、单路径标注数据集的离线基准测试会不公平地惩罚有效的替代操作,而在线基准测试则因实时评估的动态性和不可预测性,存在可扩展性和可复现性差的问题。其次,现有基准测试将代理视为单一的黑箱系统,忽视了各个独立组件的贡献,这常常导致不公平的比较或掩盖了关键的性能瓶颈。为应对这些局限,我们提出了MobiBench,这是首个面向移动图形用户界面代理的模块化、多路径感知离线基准测试框架,能够在完全离线的环境下实现高保真、可扩展且可复现的评估。我们的实验表明,MobiBench与人类评估者的一致性达到了94.72%,与精心设计的在线基准测试相当,同时保持了静态离线基准测试的可扩展性和可复现性。此外,我们全面的模块级分析揭示了几项关键发现,包括对移动图形用户界面代理中使用的多种技术的系统性评估、跨模型规模的最优模块配置、当前大型基础模型的固有局限性,以及设计能力更强、成本效益更高的移动代理的可操作指南。