Modern policy optimization methods in applied reinforcement learning, such as Trust Region Policy Optimization and Policy Mirror Descent, are often based on the policy gradient framework. While theoretical guarantees have been established for this class of algorithms, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. softmax, and it generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization class, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total expected Bregman divergence of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.
翻译:应用强化学习的现代政策优化方法,如信任区域政策优化和政策镜镜源等,往往以政策梯度框架为基础。虽然已经为这一类算法,特别是在表格设置中确立了理论保障,但使用一般的对称办法仍然大都是没有道理的。在这项工作中,我们引入了一个基于镜底的、自然能容纳一般对称化的新型政策优化框架。我们的计划引领的政策类别恢复了已知的类别,例如软马克思,并产生了新的类别,取决于镜像地图的选择。对于一般镜像地图和对称化类别,我们建立了价值函数、全球线性趋同率等更新的准移动性,我们把预期的算法的布雷格曼总差异与其路径捆绑在一起。为了展示我们的框架适应一般对称称法计划的能力,我们提出了一个涉及浅色神经网络的案例研究。