政策梯级组合与近线-夸德鲁监管者全球最佳政策</s> (Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators)

Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.

翻译：向决策者提供部分信息的非线性控制系统在各种应用中十分普遍。作为研究此类非线性系统的一个步骤,这项工作探索了强化学习方法,以寻找近线性赤道调节系统的最佳政策。特别是,我们考虑一种动态系统,将线性和非线性成分结合起来,并受同一结构的政策管理。假设非线性部分由内核组成,并具有小的利普西茨系数,我们就体现了成本功能的优化面貌。虽然成本功能一般是非电解的,但我们在全球优化器附近建立了地方强固的粘合性和平稳性。此外,我们提议了一个初始化机制来利用这些特性。在开发的基础上,我们设计了一种政策梯度算法,保证与全球最佳政策一致,使用线性率。</s>