Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.
翻译:Meta-RL(Meta-RL)的信用分配仍然不为人所知。现有的方法要么忽视了用于适应前行为的信用分配,要么天真地加以执行。这导致元培训期间的抽样效率低下,任务识别战略无效。本文件对基于梯度的Meta-RL(Meta-RL)的信用分配进行了理论分析。我们根据获得的洞察力开发了一种新的元学习算法,该算法既克服了信用分配差的问题,也克服了以往在估算元政策梯度方面的困难。通过控制适应前和调整政策在元政策搜索期间的统计距离,拟议的算法提高了适应前政策的效率和稳定的元学习效率。我们的方法导致了更高级的适应前政策行为,并在抽样效率、时钟时间和无干扰性表现方面始终超越以前的Meta-RL算法。