Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA's properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.
翻译:奖励制成(RS)是强化学习的有力方法(RL),用以克服稀有或无信息奖赏的问题。然而,斯普斯卡通常依赖人工设计的塑造奖励功能,其构建过程耗时且容易出错。这也需要与自主学习目标相违背的域知识。我们引入了强化学习优化形状成形法(ROSA),这是一个自动奖励制成框架,在两个代理商的马克夫游戏中构建了成形奖励功能。一个奖励制成代理商(Shaper)利用转换控制来确定哪个州为更有效率的学习添加奖赏,而另一个代理商(主计长)则利用这些成型奖状学习任务的最佳政策。我们证明,采用现有RL算法的ROSA学会构建一个成形调整功能,有助于确保与高绩效政策高效融合。我们在三项实验中展示了罗萨的特性,并展示了它相对于挑战性微微奖赏环境中最先进的RS算法的优性表现。