用于学习导航网络的逆向环境生成 (Adversarial Environment Generation for Learning to Navigate the Web)

Learning to autonomously navigate the web is a difficult sequential decision making task. The state and action spaces are large and combinatorial in nature, and websites are dynamic environments consisting of several pages. One of the bottlenecks of training web navigation agents is providing a learnable curriculum of training environments that can cover the large variety of real-world websites. Therefore, we propose using Adversarial Environment Generation (AEG) to generate challenging web environments in which to train reinforcement learning (RL) agents. We provide a new benchmarking environment, gMiniWoB, which enables an RL adversary to use compositional primitives to learn to generate arbitrarily complex websites. To train the adversary, we propose a new technique for maximizing regret using the difference in the scores obtained by a pair of navigator agents. Our results show that our approach significantly outperforms prior methods for minimax regret AEG. The regret objective trains the adversary to design a curriculum of environments that are "just-the-right-challenge" for the navigator agents; our results show that over time, the adversary learns to generate increasingly complex web navigation tasks. The navigator agents trained with our technique learn to complete challenging, high-dimensional web navigation tasks, such as form filling, booking a flight etc. We show that the navigator agent trained with our proposed Flexible b-PAIRED technique significantly outperforms competitive automatic curriculum generation baselines -- including a state-of-the-art RL web navigation approach -- on a set of challenging unseen test environments, and achieves more than 80% success rate on some tasks.

翻译：自动浏览网络是一项困难的顺序决策任务。州和行动空间是庞大的, 具有组合性质, 网站是由几页组成的动态环境。培训网络导航代理的瓶颈之一是提供可学习的培训环境课程, 能够覆盖大量各种真实世界网站。因此, 我们提议使用反向环境生成( AEG) 来创造具有挑战性的网络环境, 用于培训强化学习( RL) 代理。我们提供了一个新的基准环境。 gMiniWoB, 这使得 RL 竞争对手能够使用组成原始来学习任意的复杂自动网站。为了培训对手, 我们提出了一个新的技术, 利用一组导航代理获得的分数差异来最大限度地减少遗憾。我们的结果显示, 我们的方法大大优于先前的微缩后悔 AEG。遗憾目标引导对手设计一个“ 正确对正中程” 的环境课程。我们的成果显示, 随着时间的推移, 对手学会了越来越复杂的网络导航任务。导航代理者们通过我们所培训的系统测试的系统, 展示了我们所培训的系统格式。