The application of reinforcement learning (RL) in robotic control is still limited in the environments with sparse and delayed rewards. In this paper, we propose a practical self-imitation learning method named Self-Imitation Learning with Constant Reward (SILCR). Instead of requiring hand-defined immediate rewards from environments, our method assigns the immediate rewards at each timestep with constant values according to their final episodic rewards. In this way, even if the dense rewards from environments are unavailable, every action taken by the agents would be guided properly. We demonstrate the effectiveness of our method in some challenging continuous robotics control tasks in MuJoCo simulation and the results show that our method significantly outperforms the alternative methods in tasks with sparse and delayed rewards. Even compared with alternatives with dense rewards available, our method achieves competitive performance. The ablation experiments also show the stability and reproducibility of our method.
翻译:机器人控制中强化学习(RL)的应用在少少且延迟的奖励环境中仍然有限。 在本文中,我们建议了一种实用的自我计量学习方法,名为“不断回报的自我模仿学习 ” ( SILCR ) 。 我们的方法不是要求环境的手工定义即时奖励,而是在每一时间步分配即时奖励,根据最后的偶发奖励,定值不变。 这样,即使没有来自环境的密集奖励,代理人采取的每一项行动都将得到适当的指导。 我们在MuJoCo模拟中展示了我们的方法在挑战持续机器人控制任务中的有效性,结果显示我们的方法大大超越了在少且延迟的奖励工作中的替代方法。即使与可用大量奖励的替代方法相比,我们的方法也取得了竞争性的绩效。 通缩实验还显示了我们方法的稳定性和可复制性。