Although deep reinforcement learning (RL) has been successfully applied to a variety of robotic control tasks, it's still challenging to apply it to real-world tasks, due to the poor sample efficiency. Attempting to overcome this shortcoming, several works focus on reusing the collected trajectory data during the training by decomposing them into a set of policy-irrelevant discrete transitions. However, their improvements are somewhat marginal since i) the amount of the transitions is usually small, and ii) the value assignment only happens in the joint states. To address these issues, this paper introduces a concise yet powerful method to construct Continuous Transition, which exploits the trajectory information by exploiting the potential transitions along the trajectory. Specifically, we propose to synthesize new transitions for training by linearly interpolating the consecutive transitions. To keep the constructed transitions authentic, we also develop a discriminator to guide the construction process automatically. Extensive experiments demonstrate that our proposed method achieves a significant improvement in sample efficiency on various complex continuous robotic control problems in MuJoCo and outperforms the advanced model-based / model-free RL methods. The source code is available.
翻译:尽管深入强化学习(RL)成功地应用于各种机器人控制任务,但由于抽样效率低,将它应用于现实世界的任务仍具有挑战性。为了克服这一缺陷,一些工作的重点是在培训期间重新使用收集到的轨迹数据,将这些数据分解成一系列与政策无关的离散过渡。但是,它们的改进有些微弱,因为(一) 过渡的数量通常很小,和(二) 价值分配只在联合州发生。为了解决这些问题,本文件提出了一个简明而有力的方法,用于构建持续过渡,利用轨迹信息,沿轨迹进行可能的过渡。具体地说,我们提议通过线性地将连续过渡综合新的培训过渡。为了保持既定过渡的正确性,我们还开发了一种指导施工进程的区分器。广泛的实验表明,我们所提议的方法在穆乔科各种复杂的连续机器人控制问题上取得了显著的样本效率改进,并超越了先进的模型/无型RL方法。源代码是可用的。