Deep Reinforcement Learning (DRL) has made tremendous advances in both simulated and real-world robot control tasks in recent years. Nevertheless, applying DRL to novel robot control tasks is still challenging, especially when researchers have to design the action and observation space and the reward function. In this paper, we investigate partial observability as a potential failure source of applying DRL to robot control tasks, which can occur when researchers are not confident whether the observation space fully represents the underlying state. We compare the performance of three common DRL algorithms, TD3, SAC and PPO under various partial observability conditions. We find that TD3 and SAC become easily stuck in local optima and underperform PPO. We propose multi-step versions of the vanilla TD3 and SAC to improve robustness to partial observability based on one-step bootstrapping.
翻译:近年来,在模拟和现实世界机器人控制任务方面都取得了巨大进步,然而,将DRL应用于新的机器人控制任务仍然具有挑战性,特别是当研究人员必须设计行动和观测空间以及奖赏功能时。在本文中,我们调查部分可观察性是将DRL应用于机器人控制任务的潜在失败来源,当研究人员对观测空间是否充分代表基本状态缺乏信心时,就可能出现这种情况。我们比较了三种通用DRL算法,即TD3、SAC和PPO在各种部分可观察条件下的性能。我们发现TD3和SAC很容易被困在本地的Popima和不完善的PPO中。我们提出了香草TD3和SAC的多步版本,以便在单步制制制制制下提高部分可观察性。