We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
翻译:本文研究如何为GUI智能体配备可扩展的记忆机制,以提升其在陌生界面和长周期任务中的泛化能力。现有的GUI智能体通常将历史轨迹压缩为文本标记,这会导致上下文长度急剧膨胀,并丢失关键的视觉线索(如控件的精确尺寸和位置)。我们提出一种连续记忆方法,利用视觉语言模型(VLM)自身作为编码器,将每条GUI轨迹编码为固定长度的连续嵌入序列;这些嵌入直接注入主干网络的输入层,从而在保留细粒度视觉信息的同时显著降低上下文开销。随着记忆容量和检索深度的增加,系统性能呈现单调提升,这与因长提示而性能下降的文本记忆形成鲜明对比。为实现低成本记忆扩展,我们引入一种自动扩展的数据飞轮机制,其包含以下环节:(i)通过搜索发现新环境;(ii)利用开源VLM合成任务;(iii)由智能体执行轨迹推演;(iv)使用同一VLM验证任务成功率。通过该流程,我们以约4000美元的成本采集了超过10万条轨迹,并仅使用1500个样本对记忆编码器(基于Q-Former的LoRA微调,参数量占比1.2%)进行微调。在真实世界的GUI基准测试中,配备记忆增强的智能体在长周期任务和分布偏移场景下均持续提升任务成功率。值得注意的是,Qwen-2.5-VL-7B模型结合连续记忆机制,达到了与闭源前沿模型(如GPT-4o、Claude-4)相当的性能水平。