Combined with demonstrations, deep reinforcement learning can efficiently develop policies for manipulators. However, it takes time to collect sufficient high-quality demonstrations in practice. And human demonstrations may be unsuitable for robots. The non-Markovian process and over-reliance on demonstrations are further challenges. For example, we found that RL agents are sensitive to demonstration quality in manipulation tasks and struggle to adapt to demonstrations directly from humans. Thus it is challenging to leverage low-quality and insufficient demonstrations to assist reinforcement learning in training better policies, and sometimes, limited demonstrations even lead to worse performance. We propose a new algorithm named TD3fG (TD3 learning from a generator) to solve these problems. It forms a smooth transition from learning from experts to learning from experience. This innovation can help agents extract prior knowledge while reducing the detrimental effects of the demonstrations. Our algorithm performs well in Adroit manipulator and MuJoCo tasks with limited demonstrations.
翻译:结合演示,深度强化学习可以高效地为操纵者开发策略。然而,在实践中收集足够高质量的演示需要时间。并且,人类演示可能不适合机器人。非马可夫过程和对演示的过分依赖是进一步的挑战。例如,我们发现在操作任务中,RL 代理对演示质量敏感,很难直接适应来自人类的演示。因此,利用低质量和有限演示来协助强化学习训练更好的策略是具有挑战性的,有时,有限演示甚至会导致更差的性能。我们提出了一种称为 TD3fG(从生成器学习TD3)的新算法来解决这些问题。它形成了从专家学习到从经验学习的平稳转换。这种创新可以帮助代理提取先前知识,同时减少演示的不利影响。我们的算法在有限演示的 Adroit 操纵器和 MuJoCo 任务中表现良好。