Prior access to domain knowledge could significantly improve the performance of a reinforcement learning agent. In particular, it could help agents avoid potentially catastrophic exploratory actions, which would otherwise have to be experienced during learning. In this work, we identify consistently undesirable actions in a set of previously learned tasks, and use pseudo-rewards associated with them to learn a prior policy. In addition to enabling safer exploratory behaviors in subsequent tasks in the domain, we show that these priors are transferable to similar environments, and can be learned off-policy and in parallel with the learning of other tasks in the domain. We compare our approach to established, state-of-the-art algorithms in both discrete as well as continuous environments, and demonstrate that it exhibits a safer exploratory behavior while learning to perform arbitrary tasks in the domain. We also present a theoretical analysis to support these results, and briefly discuss the implications and some alternative formulations of this approach, which could also be useful in certain scenarios.
翻译:在这项工作中,我们发现在一系列先前学到的任务中一贯不可取的行动,并使用与其相关的假奖励来学习一项先前的政策。除了在后来的领域中的任务中促成更安全的探索行为外,我们还表明这些前科可以转移到类似的环境中,可以从政策上学习,并同时学习该领域的其他任务。我们比较我们的做法,在离散和连续的环境中都采用既有的、最先进的算法,并表明它表现出一种更安全的探索行为,同时学习在这一领域里执行任意的任务。我们还提出理论分析,以支持这些结果,并简要讨论这一方法的影响和一些替代方案,在某些情况中也可以这样做。