Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
翻译:构建图形用户界面(GUI)智能体是一个前景广阔的研究方向,它通过模拟人类与计算机或移动设备的交互来执行多样化的GUI任务。然而,开发通用GUI智能体的一个主要挑战是缺乏跨多种操作系统和应用程序的充足轨迹数据,这主要是由于人工标注成本高昂。本文提出TongUI框架,通过从丰富的多模态网络教程中学习来构建通用GUI智能体。具体而言,我们爬取并处理在线GUI教程(如视频和文章)转化为GUI智能体轨迹数据,由此构建了GUI-Net数据集,该数据集包含跨越五个操作系统和超过200个应用程序的14.3万条轨迹数据。我们通过在GUI-Net上微调Qwen2.5-VL-3B/7B模型开发了TongUI智能体,该智能体在常用的基础任务和导航基准测试中表现出显著的性能提升,在多个基准上优于基线智能体约10%,这证明了GUI-Net数据集的有效性,并凸显了我们TongUI框架的重要意义。我们将很快完全开源代码、GUI-Net数据集以及训练好的模型。