We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.
翻译:我们考虑一种不适当的强化学习环境,即为未知的马尔科夫决策程序向学习者提供$M美元的基础控制器,并且希望将其最佳地结合起来,以产生一个可能的新控制器,该控制器能够优于每个基数。这可能有助于在不匹配或模拟环境中学习的对控制器的调控,在不匹配或模拟环境中学习,为特定的目标环境获得良好的控制器,试验次数相对较少。为此,我们提议两种算法:(1) 基于政策分级法;(2) 一种算法,可以转换为简单的基于行动器-批评器(AC)的方案和取决于现有信息的自然作用器-批评(NAC)方案。两种算法都适用于给定控制器的不适当混合物。对于第一个情况,我们得出趋同率保证使用梯度或骨架。对于以AC为基础的方法,我们为基本AC案件的一个固定点和全球最佳化提供了趋同率保证。在(i) 稳定木板的标准化政策基准上,即使是在不稳定的递后期政策上,可以稳定地显示我们不稳的递后期政策。