We propose a novel reinforcement learning framework that performs self-supervised online reward shaping, yielding faster, sample efficient performance in sparse reward environments. The proposed framework alternates between updating a policy and inferring a reward function. While the policy update is done with the inferred, potentially dense reward function, the original sparse reward is used to provide a self-supervisory signal for the reward update by serving as an ordering over the observed trajectories. The proposed framework is based on the theory that altering the reward function does not affect the optimal policy of the original MDP as long as we maintain certain relations between the altered and the original reward. We name the proposed framework \textit{ClAssification-based REward Shaping} (CaReS), since we learn the altered reward in a self-supervised manner using classifier based reward inference. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is not only significantly more sample efficient than the state-of-the-art baseline, but also achieves a similar sample efficiency to MDPs that use hand-designed dense reward functions.
翻译:我们提出一个新的强化学习框架,在微薄的奖赏环境中进行自我监督的在线奖赏制成,产生更快的、抽样的高效业绩。拟议框架在更新政策与推算奖赏功能之间互换。虽然政策更新是用推断的、潜在密集的奖赏功能完成的,但原始的稀有奖赏用于提供奖赏更新的自我监督信号,通过对观察到的轨迹进行订单。拟议框架所依据的理论是,改变奖赏功能不会影响原MDP的最佳政策,只要我们保持被改变的和原始奖赏之间的某些关系。我们列出拟议的框架 \ textit{Classerization-broad reward Shaping} (CaReS),因为我们利用基于奖赏的叙级来以自我监督的方式学习改变的奖赏。 几个稀有环境的实验结果表明,拟议的算法不仅比最先进的基准效率要高得多,而且与使用手工设计的密集奖赏功能的MDP具有相似的样本效率。