经验利用与启发式演示在离策略强化学习中的应用——以机器人操作为例 (Exploiting Symmetry and Heuristic Demonstrations in Off-policy Reinforcement Learning for Robotic Manipulation)

Reinforcement learning demonstrates significant potential in automatically building control policies in numerous domains, but shows low efficiency when applied to robot manipulation tasks due to the curse of dimensionality. To facilitate the learning of such tasks, prior knowledge or heuristics that incorporate inherent simplification can effectively improve the learning performance. This paper aims to define and incorporate the natural symmetry present in physical robotic environments. Then, sample-efficient policies are trained by exploiting the expert demonstrations in symmetrical environments through an amalgamation of reinforcement and behavior cloning, which gives the off-policy learning process a diverse yet compact initiation. Furthermore, it presents a rigorous framework for a recent concept and explores its scope for robot manipulation tasks. The proposed method is validated via two point-to-point reaching tasks of an industrial arm, with and without an obstacle, in a simulation experiment study. A PID controller, which tracks the linear joint-space trajectories with hard-coded temporal logic to produce interim midpoints, is used to generate demonstrations in the study. The results of the study present the effect of the number of demonstrations and quantify the magnitude of behavior cloning to exemplify the possible improvement of model-free reinforcement learning in common manipulation tasks. A comparison study between the proposed method and a traditional off-policy reinforcement learning algorithm indicates its advantage in learning performance and potential value for applications.

翻译：强化学习在自动构建控制策略时展现出相当大的潜力，但由于维度的诅咒，在机器人操作任务中应用时效率很低。为了促进这些任务的学习，结合本质上的简化的先验知识或启发式方法可以有效地提高学习性能。本文旨在定义并结合物理机器人环境中自然对称性。然后，利用对称环境中的专家演示实现离策略动作与行为克隆的融合，为离策略学习过程提供多样且紧凑的初始方法，从而训练样本高效的策略。此外，本文提出了一个近期的概念的严谨框架，并探讨了其在机器人操作任务中的适用范围。通过在模拟实验研究中进行两个点到点到达任务，证明了该方法。实验中，采用PID控制器跟踪线性关节空间轨迹，并使用硬编码时间逻辑生成中间点来产生演示。研究结果展示了演示数量的影响，并量化了行为克隆的大小，以阐明模型无关离策略强化学习在常见操作任务中可能的性能提升。将所提出的方法与传统的离策略强化学习算法进行比较研究后，指出了该方法在学习性能方面的优势和潜在应用价值。