External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.
翻译:外部知识在计算机使用代理的近期发展中起到了关键作用。我们发现了一个关键的知识-执行差距:检索到的知识往往无法有效转化为现实世界任务的执行。我们的分析表明,即使知识正确率高达90%,执行成功率也仅为41%。为弥合这一差距,我们提出了UI-Evol,一个用于自主图形用户界面知识演进的即插即用模块。UI-Evol包含两个阶段:回溯阶段,从实际代理-环境交互中提取忠实的目标动作序列;以及批判阶段,通过将这些序列与外部参考进行比较来精炼现有知识。我们在OSWorld基准测试上使用最先进的Agent S2进行了全面实验。结果表明,UI-Evol不仅显著提升了任务性能,还解决了先前被忽视的计算机使用代理行为标准差较高的问题,从而在计算机使用任务上实现了更优的性能,并大幅提高了代理的可靠性。