Learning continuous control in high-dimensional sparse reward settings, such as robotic manipulation, is a challenging problem due to the number of samples often required to obtain accurate optimal value and policy estimates. While many deep reinforcement learning methods have aimed at improving sample efficiency through replay or improved exploration techniques, state of the art actor-critic and policy gradient methods still suffer from the hard exploration problem in sparse reward settings. Motivated by recent successes of value-based methods for approximating state-action values, like RBF-DQN, we explore the potential of value-based reinforcement learning for learning continuous robotic manipulation tasks in multi-task sparse reward settings. On robotic manipulation tasks, we empirically show RBF-DQN converges faster than current state of the art algorithms such as TD3, SAC, and PPO. We also perform ablation studies with RBF-DQN and have shown that some enhancement techniques for vanilla Deep Q learning such as Hindsight Experience Replay (HER) and Prioritized Experience Replay (PER) can also be applied to RBF-DQN. Our experimental analysis suggests that value-based approaches may be more sensitive to data augmentation and replay buffer sample techniques than policy-gradient methods, and that the benefits of these methods for robot manipulation are heavily dependent on the transition dynamics of generated subgoal states.
翻译:在高度稀少的奖励环境中,例如机器人操纵,不断学习,是一个具有挑战性的问题,因为要获得准确的最佳价值和政策估计,往往需要大量样本,因此,获得高层次的少许奖励环境,例如机器人操纵,这是一个具有挑战性的难题。虽然许多深层强化学习方法的目的是通过重播或改进勘探技术来提高抽样效率,但是在微薄的奖励环境中,艺术演员-批评和政策梯度的状态仍然受到严酷的探索问题的影响。我们受最近以价值为基础的方法在接近州-行动价值方面取得成功的启发,如RBF-DQN,我们探索基于价值的强化学习在多任务稀少的奖励环境中学习连续机器人操纵任务的可能性。关于机器人操作的任务,我们从经验上显示RBF-DQN比目前艺术算法(如TD3、SAC和PPO)更快地聚集。我们还与RBF-DQN进行反动动研究,并显示,一些基于价值的深渊学习技术的增强技术,如Hindsight Eright Repplay (H) 和优先经验重现经验重塑重塑经验重塑的重塑技术,对于RBFBFFF-DQN。我们的实验分析表明,这些基于模型的试制方法可能会产生更敏感的试制方法。