MAI-UI技术报告：面向真实世界的通用图形用户界面智能体 (MAI-UI Technical Report: Real-World Centric Foundation GUI Agents)

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

翻译：图形用户界面（GUI）智能体的发展有望彻底革新下一代人机交互。基于这一愿景，我们提出了MAI-UI，一个涵盖全尺寸谱系（包括2B、8B、32B和235B-A22B变体）的通用GUI智能体家族。我们识别出现实部署面临的四个关键挑战：缺乏原生的智能体-用户交互、纯UI操作的局限性、缺乏实用的部署架构，以及在动态环境中的脆弱性。MAI-UI通过一套统一的方法论应对这些问题：一个自演进的数据流水线，将导航数据扩展至包含用户交互和MCP工具调用；一个原生的设备-云协作系统，根据任务状态路由执行；以及一个具备高级优化功能的在线强化学习框架，用于扩展并行环境规模和上下文长度。MAI-UI在GUI基础任务和移动导航任务上均取得了新的最先进水平。在基础任务基准测试中，其在ScreenSpot-Pro上达到73.5%，在MMBench GUI L2上达到91.3%，在OSWorld-G上达到70.9%，在UI-Vision上达到49.2%，在ScreenSpot-Pro上超越了Gemini-3-Pro和Seed1.8。在移动GUI导航任务中，其在AndroidWorld上创造了76.7%的新SOTA，超越了UI-Tars-2、Gemini-2.5-Pro和Seed1.8。在MobileWorld上，MAI-UI获得了41.7%的成功率，显著优于端到端GUI模型，并与基于Gemini-3-Pro的智能体框架表现相当。我们的在线强化学习实验表明，将并行环境数量从32扩展到512带来了显著增益（+5.2个百分点），将环境步数预算从15增加到50也带来了显著增益（+4.3个百分点）。最后，原生的设备-云协作系统将设备端性能提升了33%，减少了超过40%的云端模型调用，并保护了用户隐私。