In recent years, a variety of tasks have been accomplished by deep reinforcement learning (DRL). However, when applying DRL to tasks in a real-world environment, designing an appropriate reward is difficult. Rewards obtained via actual hardware sensors may include noise, misinterpretation, or failed observations. The learning instability caused by these unstable signals is a problem that remains to be solved in DRL. In this work, we propose an approach that extends existing DRL models by adding a subtask to directly estimate the variance contained in the reward signal. The model then takes the feature map learned by the subtask in a critic network and sends it to the actor network. This enables stable learning that is robust to the effects of potential noise. The results of experiments in the Atari game domain with unstable reward signals show that our method stabilizes training convergence. We also discuss the extensibility of the model by visualizing feature maps. This approach has the potential to make DRL more practical for use in noisy, real-world scenarios.
翻译:近年来,通过深层强化学习(DRL)完成了一系列任务。然而,当将DRL应用到现实环境中的任务时,设计适当的奖励是很困难的。通过实际硬件传感器获得的奖励可能包括噪音、误解或失败的观测。这些不稳定信号造成的学习不稳定问题仍有待于DRL解决。在这项工作中,我们提出一种方法,通过增加一个子任务来扩展现有的DRL模型,直接估计奖赏信号中包含的差异。然后,该模型将子任务在评论者网络中学习的特写地图发送到演员网络中。这样,就能稳定地学习到潜在噪音的影响。阿塔里游戏域的实验结果和不稳定的奖励信号表明我们的方法稳定了培训的趋同。我们还通过直观地貌地图来讨论模型的普及性。这个方法有可能使DRL在噪音、真实世界情景中更加实用。