Modern policy optimization methods in applied reinforcement learning are often inspired by the trust region policy optimization algorithm, which can be interpreted as a particular instance of policy mirror descent. While theoretical guarantees have been established for this framework, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. tabular softmax, log-linear, and neural policies. It also generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization function, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total variation of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.
翻译:应用强化学习的现代政策优化方法往往受到信任区域政策优化算法的启发,这种算法可以被解释为政策镜底的特例。虽然已经为这一框架,特别是表格设置了理论保障,但使用一般的平衡办法仍然大都是没有道理的。在这项工作中,我们引入了一个基于镜底的新的政策优化框架,这一框架自然地适应了一般的平衡。我们计划引发的政策类别恢复了已知的类别,例如表软式软模克、日志线和神经系统政策。它还产生新的类别,取决于镜像地图的选择。对于一般镜像地图和对称功能的功能,我们建立了价值函数、全球线性趋同率方面更新的准移动性,我们沿其路径将算法的总体变化捆绑起来。为了展示我们的框架适应一般平衡办法的能力,我们提出了一个涉及浅色神经网络的案例研究。