The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
翻译:图形用户界面(GUI)智能体的发展有望彻底革新下一代人机交互。基于这一愿景,我们提出了MAI-UI,一个涵盖全尺寸谱系(包括2B、8B、32B和235B-A22B变体)的通用GUI智能体家族。我们识别出现实部署面临的四个关键挑战:缺乏原生的智能体-用户交互、纯UI操作的局限性、缺乏实用的部署架构,以及在动态环境中的脆弱性。MAI-UI通过一套统一的方法论应对这些问题:一个自演进的数据流水线,将导航数据扩展至包含用户交互和MCP工具调用;一个原生的设备-云协作系统,根据任务状态路由执行;以及一个具备高级优化功能的在线强化学习框架,用于扩展并行环境规模和上下文长度。MAI-UI在GUI基础任务和移动导航任务上均取得了新的最先进水平。在基础任务基准测试中,其在ScreenSpot-Pro上达到73.5%,在MMBench GUI L2上达到91.3%,在OSWorld-G上达到70.9%,在UI-Vision上达到49.2%,在ScreenSpot-Pro上超越了Gemini-3-Pro和Seed1.8。在移动GUI导航任务中,其在AndroidWorld上创造了76.7%的新SOTA,超越了UI-Tars-2、Gemini-2.5-Pro和Seed1.8。在MobileWorld上,MAI-UI获得了41.7%的成功率,显著优于端到端GUI模型,并与基于Gemini-3-Pro的智能体框架表现相当。我们的在线强化学习实验表明,将并行环境数量从32扩展到512带来了显著增益(+5.2个百分点),将环境步数预算从15增加到50也带来了显著增益(+4.3个百分点)。最后,原生的设备-云协作系统将设备端性能提升了33%,减少了超过40%的云端模型调用,并保护了用户隐私。