In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.
翻译:在强化学习(RL)中,利用先前已解决任务的知识的能力可以使代理商迅速解决新问题。在某些情况下,这些新问题可以通过整合先前已解决的原始任务(任务构成)的解决方案来大致解决。否则,先前的知识可以被用于调整新问题的奖励功能,其方式可以使最佳政策保持不变,但能够更快地学习(回报形成)。在这项工作中,我们开发了一个总体框架,用于奖励在加密常规RL中塑造和任务构成的奖赏。为了做到这一点,我们得出了一种精确的关系,将两个已成正本的对正本的RL问题与不同奖励功能和动态的最佳软价值功能连接起来。我们展示了由此产生的关系如何导致在成正本正规RL中形成奖赏的一般结果。我们随后推广了这一方法,以便得出一种精确的关系,将成正本常规的多重任务构成的最佳价值功能联系起来。我们用实验来验证这些理论贡献。我们通过证明,奖励的形成和任务构成导致在不同环境中更快的学习。