Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.
翻译:代理人在培训和部署期间应避免不安全行为。 这通常要求模拟器和对不安全行为的程序说明。 不幸的是,模拟器并非总能提供,对许多现实世界的任务来说,程序上的具体限制可能是困难的或不可能的。 最近采用的技术ReQueST的目的是通过从安全的人类轨迹中学习环境的神经模拟器来解决这个问题,然后利用学习过的模拟器从人类反馈中有效地学习奖励模式。然而,尚不清楚这种方法在复杂的3D环境中是否可行,因为从真实人那里获得的反馈――能否达到足够的像素模拟器质量,以及人类数据要求是否在数量和质量上都可行。在本文件中,我们肯定地回答这个问题,利用ReQueST来训练一名代理人完全利用人类承包商的数据来进行3D第一人物体收集任务。我们表明,由此产生的代理人的不安全行为与标准强化学习相比,其规模是否下降了。