We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.
翻译:我们提议将歧视性奖励共同培训(直接)作为深层强化学习算法的延伸。基于自我计量学习(SIL)概念,我们引入了仿制缓冲,以储存由其返回后决定的政策所产生的有益轨迹。一个歧视者网络与该政策同时接受培训,以区分现行政策产生的轨迹和以往政策产生的有益轨迹。歧视者的裁决被用来构建一个优化政策的奖赏信号。通过将先前的经验相互推导,指导政策优化成为替代方,引导政策优化走向更有价值的地区,从而学习最佳政策。我们的结果显示,在稀疏和转移环境中,指导方优于最先进的算法,能够为政策提供代金奖,并将优化方向转向有价值的领域。