Robotic skills can be learned via imitation learning (IL) using user-provided demonstrations, or via reinforcement learning (RL) using large amountsof autonomously collected experience.Both methods have complementarystrengths and weaknesses: RL can reach a high level of performance, but requiresexploration, which can be very time consuming and unsafe; IL does not requireexploration, but only learns skills that are as good as the provided demonstrations.Can a single method combine the strengths of both approaches? A number ofprior methods have aimed to address this question, proposing a variety of tech-niques that integrate elements of IL and RL. However, scaling up such methodsto complex robotic skills that integrate diverse offline data and generalize mean-ingfully to real-world scenarios still presents a major challenge. In this paper, ouraim is to test the scalability of prior IL + RL algorithms and devise a system basedon detailed empirical experimentation that combines existing components in themost effective and scalable way. To that end, we present a series of experimentsaimed at understanding the implications of each design decision, so as to develop acombined approach that can utilize demonstrations and heterogeneous prior datato attain the best performance on a range of real-world and realistic simulatedrobotic problems. Our complete method, which we call AW-Opt, combines ele-ments of advantage-weighted regression [1, 2] and QT-Opt [3], providing a unifiedapproach for integrating demonstrations and offline data for robotic manipulation.Please see https://awopt.github.io for more details.
翻译:机器人技能可以通过模仿学习(IL)来学习, 使用用户提供的演示, 或使用大量自主收集的经验进行强化学习(RL)来学习。 这两种方法具有互补的优点和弱点: RL可以达到高水平的性能,但需要探索,这非常耗时和不安全; IL不需要探索,而只是学习与所提供的演示一样好的技能。 一种单一的方法能够将两种方法的优点结合起来吗? 一些原始方法旨在解决这一问题, 提出各种将IL和RL的元素融合起来的技术教训。 然而, 将这种方法整合到复杂的机器人技能中, 将多种离线数据整合起来, 并且向现实世界情景情景情景的情景进行概括化。 在本文件中, 我们的目的是测试以前的 IL + RL 算法的可缩放性, 并设计一个基于详细经验实验的系统, 将现有组成部分结合到最有效和可缩放的方式。 最后, 我们提出一系列实验, 提供每个设计模型的图像的影响, 从而实现我们之前的模版数据方法。