The convergence of policy gradient algorithms in reinforcement learning hinges on the optimization landscape of the underlying optimal control problem. Theoretical insights into these algorithms can often be acquired from analyzing those of linear quadratic control. However, most of the existing literature only considers the optimization landscape for static full-state or output feedback policies (controllers). We investigate the more challenging case of dynamic output-feedback policies for linear quadratic regulation (abbreviated as dLQR), which is prevalent in practice but has a rather complicated optimization landscape. We first show how the dLQR cost varies with the coordinate transformation of the dynamic controller and then derive the optimal transformation for a given observable stabilizing controller. At the core of our results is the uniqueness of the stationary point of dLQR when it is observable, which is in a concise form of an observer-based controller with the optimal similarity transformation. These results shed light on designing efficient algorithms for general decision-making problems with partially observed information.
翻译:政策梯度算法在强化学习中的趋同取决于基本最佳控制问题的优化面貌。 这些算法的理论洞察力往往可以从对线性二次控制法的分析中获得。 然而,大多数现有文献只考虑静态全状态或输出反馈政策(控制器)的优化面貌。 我们调查了线性二次管理(以dLQR为缩写)动态输出反馈政策(以dLQR为缩写)这一更具挑战性的案例,它在实践中很普遍,但有相当复杂的优化面貌。我们首先展示了 dLQR 成本与动态控制器的协调转换有何差异,然后为特定可观测的稳定控制器得出了最佳转换。我们结果的核心是观察时 dLQR 的固定点的独特性,这是以观察员为基础的控制器和最佳相似性转换的简明形式。这些结果为设计有部分观察信息的一般决策问题的有效算法提供了启发。