Although deep reinforcement learning~(RL) has been successfully applied to a variety of robotic control tasks, it's still challenging to apply it to real-world tasks, due to the poor sample efficiency. Attempting to overcome this shortcoming, several works focus on reusing the collected trajectory data during the training by decomposing them into a set of policy-irrelevant discrete transitions. However, their improvements are somewhat marginal since i) the amount of the transitions is usually small, and ii) the value assignment only happens in the joint states. To address these issues, this paper introduces a concise yet powerful method to construct \textit{Continuous Transition}, which exploits the trajectory information by exploiting the potential transitions along the trajectory. Specifically, we propose to synthesize new transitions for training by linearly interpolating the conjunctive transitions. To keep the constructed transitions authentic, we also develop a discriminator to guide the construction process automatically. Extensive experiments demonstrate that our proposed method achieves a significant improvement in sample efficiency on various complex continuous robotic control problems in MuJoCo and outperforms the advanced model-based / model-free RL methods.
翻译:尽管深入强化学习~(RL)已经成功地应用于各种机器人控制任务,但由于抽样效率低,将它应用到现实世界的任务中仍然具有挑战性。为了克服这一缺陷,一些工作侧重于在培训期间重新使用收集到的轨迹数据,将这些数据分解成一系列与政策相关的离散过渡。但是,它们的改进有些微不足道,因为(一) 过渡的数量通常很小,和(二) 价值分配只发生在联合州。为了解决这些问题,本文件提出了一个简明而有力的方法,用于构建\ textitle{持续过渡},利用轨迹信息,利用轨迹信息沿轨迹进行可能的过渡。具体地说,我们提议通过线性地将交错将新的培训过渡综合起来。为了保持构建的过渡的真实性,我们还开发了一种指导建设过程的区分器。广泛的实验表明,我们所提议的方法在穆约科的各种复杂的连续机器人控制问题上取得了显著的样本效率提高,并超越了先进的模型/无型模型方法。