The temporal Credit Assignment Problem (CAP) is a well-known and challenging task in AI. While Reinforcement Learning (RL), especially Deep RL, works well when immediate rewards are available, it can fail when only delayed rewards are available or when the reward function is noisy. In this work, we propose delegating the CAP to a Neural Network-based algorithm named InferNet that explicitly learns to infer the immediate rewards from the delayed rewards. The effectiveness of InferNet was evaluated on two online RL tasks: a simple GridWorld and 40 Atari games; and two offline RL tasks: GridWorld and a real-life Sepsis treatment task. For all tasks, the effectiveness of using the InferNet inferred rewards is compared against the immediate and the delayed rewards with two settings: with noisy rewards and without noise. Overall, our results show that the effectiveness of InferNet is robust against noisy reward functions and is an effective add-on mechanism for solving temporal CAP in a wide range of RL tasks, from classic RL simulation environments to a real-world RL problem and for both online and offline learning.
翻译:长期信用分配问题(CAP)是AI中众所周知的、具有挑战性的任务。 强化学习(RL),特别是深RL,在可获得即时奖励时运作良好,但当只有延迟奖励或奖励功能吵闹时,它就会失败。 在这项工作中,我们建议将CAP下放给以神经网络为基础的算法InferNet,该算法明确学会从延迟的奖励中推断直接的回报。InferNet在两个在线RL任务上的有效性得到了评估:一个简单的GridWorld和40 Atari 游戏;两个离线的RL任务:GridWorld和一个真实的Sepis 治疗任务。对于所有任务,使用InferNet推导出的奖励的效力与立即和延迟的奖励相比较,有两个环境是:无声无声无声无声无声无声无息的。我们的结果显示,InferNet的效力是针对噪音奖励功能的强大力量,是解决从典型的RL模拟环境到现实世界的RL问题以及在线和离线学习等广泛任务中的时间访问的有效附加机制。