Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.
翻译:互动环境中现有语言定位基准要么缺乏现实世界语言要素,要么由于大量人参与收集数据或反馈信号而难以扩大规模。为了缩小这一差距,我们开发了WebShop -- -- 模拟电子商务网站环境,拥有118万美元真实世界产品和12 087美元众源文本指令。鉴于一个具体产品要求的文本指令,代理需要浏览多种类型的网页,并发布各种行动以查找、定制和购买一个项目。WebShop为语言定位提供了若干挑战,包括理解成份指示、查询(重新)格式化、理解和在网页上用吵闹的文本采取行动,以及进行战略探索。我们还收集超过1 600美元的人为任务演示,培训并评估各种代理人,利用强化学习、模仿学习和预先培训的图像和语言模型,确定多种类型的网页网页成功率为29 ⁇,这比基于实际的超额(9.6美元)要低得多,但远远低于人类专家的性能(59美元)。我们还在网站模型中分析代理人和经过培训的代理人如何发展网络智能,从而展示我们掌握更强的图像,从而显示我们掌握了更强的网络分析和将来的图像。