Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
翻译:图形用户界面(GUI)是人机交互的主要媒介,然而由于视觉元素的复杂性、动态环境以及多步推理的需求,自动化GUI交互仍然具有挑战性。现有基于视觉语言模型(VLM)的方法常受限于分辨率不足、领域不匹配以及序列决策能力欠缺。为解决这些问题,我们提出了Mano——一种基于在大规模网络和计算机系统数据上预训练的多模态基础模型构建的鲁棒GUI智能体。我们的方法集成了用于高保真数据生成的新型仿真环境、三阶段训练流程(监督微调、离线强化学习和在线强化学习)以及用于错误恢复的验证模块。Mano在多个GUI基准测试(包括Mind2Web和OSWorld)中展现出最先进的性能,在成功率和操作准确性方面取得显著提升。本研究为强化学习与VLM在实际GUI智能体部署中的有效融合提供了新见解,强调了领域特定数据、迭代式训练和整体奖励设计的重要性。