Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
翻译:最近深层强化学习的成功在很大程度上是由正规化的政策优化算法(RPO)驱动的,该算法在多个领域都有很强的绩效。在这套方法中,对代理人进行了培训,以最大限度地增加累积报酬,同时惩罚偏离某些参考或默认政策的行为。除了经验成功外,还有坚实的理论基础来理解适用于单项任务的RPO方法,与自然梯度、信任区域和变异方法相关联。然而,对于多任务环境中的默认政策的适当属性,正式理解有限,而当实地转向培训更普遍能力强的代理人时,这是一个日益重要的领域。在这里,我们迈出了第一步,将默认政策的质量与对优化的影响正式联系起来,从而填补这一空白。然后利用这些结果,我们得出了具有强有力的绩效保障的多任务学习原则性 RPO算法。