Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms, with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
翻译:最近深层强化学习的成功大部分是由正规化的政策优化算法(RPO)驱动的,在多个领域都有很强的绩效。在这个方法类别中,对代理人进行了培训,以最大限度地增加累积报酬,同时惩罚偏离某些参考或默认政策的行为。除了经验成功外,还有一个坚实的理论基础,用于理解适用于单项任务的“RPO”方法,与自然梯度、信任区域和变异方法相关联。然而,对于多任务环境中的默认政策的适当属性,正式理解有限,而当实地转向培训更普遍有能力的代理人时,这是一个日益重要的领域。在这里,我们迈出了第一步,通过正式将默认政策的质量与其对优化的影响联系起来来填补这一空白。然后利用这些结果,我们得出了一种原则性的“RPO”算法,用强有力的绩效保证来进行多任务学习。