We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the learner must satisfy. The baseline policy can arise from demonstration data or a teacher agent and may provide useful cues for learning, but it might also be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints, which might encode safety, fairness or other application-specific requirements. In order to safely learn from baseline policies, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze our algorithm theoretically and provide a finite-time convergence guarantee. In our experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average.
翻译:当提供(1) 基线控制政策和(2) 学习者必须满足的一系列制约因素时,我们考虑强化学习的问题。基线政策可能来自示范数据或教师代理人,可能为学习提供有用的提示,但也可能对手头的任务来说是次优的,而且没有保证满足特定的限制,这些限制可能将安全、公平或其他特定应用要求编码成。为了从基线政策中安全地学习,我们建议了一种迭代政策优化算法,这种算法可以替代尽可能扩大预期的回报,尽可能缩短与基线政策的距离,将政策投射到约束性满足性标准上。 我们从理论上分析我们的算法,并提供固定时间的趋同保证。 在我们关于五项不同控制任务的实验中,我们的算法始终超越若干最先进的基线,平均达到10倍的限制性违反率和40%的较高报酬。