We have developed a framework to reliably build agents capable of UI navigation. The state space is simplified from raw-pixels to a set of UI elements extracted from screen understanding, such as OCR and icon detection. The action space is restricted to the UI elements plus a few global actions. Actions can be customized for tasks and each action is a sequence of basic operations conditioned on status checks. With such a design, we are able to train DQfD and BC agents with a small number of demonstration episodes. We propose demo augmentation that significantly reduces the required number of human demonstrations. We made a customization of DQfD to allow demos collected on screenshots to facilitate the demo coverage of rare cases. Demos are only collected for the failed cases during the evaluation of the previous version of the agent. With 10s of iterations looping over evaluation, demo collection, and training, the agent reaches a 98.7\% success rate on the search task in an environment of 80+ apps and websites where initial states and viewing parameters are randomized.
翻译:我们开发了一个可靠地建立能够进行UI导航的代理商的框架。 州空间从原始像素简化为一组从屏幕理解中提取的UI元素, 如 OCR 和 图标检测。 行动空间仅限于 UI 元素, 外加一些全球行动 。 行动空间可以按任务量定制, 每项行动都是一系列基本操作, 以身份检查为条件。 有了这样的设计, 我们就可以用少量演示事件来培训 DQfD 和 BC 代理商。 我们建议演示能大大降低所需人类演示数量。 我们对 DQfD 进行了定制, 允许在屏幕截图上收集演示, 以便利稀有案例的演示。 在评估前一版代理商时, 只收集了未成功案例的演示。 在评估、 演示收集和培训上循环了10个迭代点, 代理商在80+ 应用程序和网站环境中的搜索任务成功率达到98.7 ⁇ 。 初步状态和查看参数是随机的。