Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
翻译:在现代强化学习(RL)中,直接政策搜索是现代强化学习(RL)中的一项工作,它在连续控制任务中的应用最近引起越来越多的注意。在这项工作中,我们调查了政策梯度(PG)方法的趋同理论,以学习线性风险敏感和强健控制器。特别是,我们开发了PG方法,可以通过取样系统轨迹以无衍生物方式采用无衍生物方式实施,并建立了全球趋同和样本复杂性,从而解决了两种基本风险敏感和稳健控制环境中的两个基本环境:有限和偏差线性指数二次曲线高斯和有限和有限和线性线性赤道扰动缓解问题。作为一个副产品,我们的结果还为PG方法的全球趋同提供了第一个样本复杂性,以解决零和线性赤道动态游戏,一个非convex-nonconcave小型摩克斯优化问题,这是具有连续空间的多试剂强化学习(MARL)的基线设置。我们的算法的一个特征是,在学习阶段,对控制器中的某些强度/风险敏感度/风险敏感度敏感度敏感度要求保持了一种隐性控制系统。