Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.
翻译:行为克隆(BC)对于机器人学习来说通常很实用,因为它允许一项政策通过在专家演示方面的监督下学习,在无报酬的情况下进行离线培训,而无需奖励。然而,不列颠哥伦比亚并没有有效地利用我们所称的无标签经验:混合和未知质量的数据,而没有奖励说明。这种未标的数据可以来自多种来源,如人类远程操作、脚本政策以及同一机器人上的其他代理等。为了通过数据驱动的离线机器人学习,可以使用这种无标签的经验,我们引入了“离线强化模仿学习”(ORIL) 。 ORIL首先通过对比演示家和无标签轨迹的观察来学习一种奖励功能,然后用所学的奖励来说明所有数据,最后通过离线强化学习来培训一个代理。 在一系列连续控制和模拟机器人操作任务中,我们显示,在不同的连续控制和模拟操作任务中,ORIL通过有效利用无标签的经验,始终超越了类似的不列颠公司代理。