Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
翻译:在现有的在线移动使用基准测试中,AndroidWorld因其可复现的环境和确定性评估已成为主导性基准;然而,近期智能体成功率超过90%表明其已趋于饱和,这促使我们需要一个更具挑战性的基准。此外,其环境缺乏关键应用类别,如电子商务和企业通信,且未能反映以模糊用户指令和混合工具使用为特征的真实移动使用场景。为弥补这一差距,我们引入了MobileWorld,这是一个显著更具挑战性的基准,旨在更好地反映真实世界的移动使用情况,包含跨越20个应用程序的201项任务,同时保持与AndroidWorld同等水平的可复现评估。MobileWorld的难度体现在两个方面。首先,它强调具有跨应用交互的长视野任务:与AndroidWorld相比,MobileWorld平均需要近两倍的任务完成步骤(27.8步 vs. 14.3步),并包含远更多的多应用任务(62.2% vs. 9.5%)。其次,MobileWorld超越了标准的图形用户界面操作,引入了新颖的任务类别,包括智能体-用户交互和MCP增强任务。为确保稳健评估,我们提供了基于快照的容器环境和精确的功能验证,包括后端数据库检查和任务回调API。我们进一步开发了一个具有扩展动作空间的规划器-执行器智能体框架,以支持用户交互和MCP调用。我们的结果显示,与AndroidWorld相比,性能出现急剧下降,最佳智能体框架和端到端模型分别仅达到51.7%和20.9%的成功率。我们的分析表明,当前模型在用户交互和MCP调用方面存在显著困难,这为迈向更稳健的下一代移动智能提供了战略路线图。