We study the global linear convergence of policy gradient (PG) methods for finite-horizon exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.
翻译:我们研究政策梯度(PG)方法的全球线性趋同(LQC)问题,研究政策梯度(PG)方法对于有限正对探索线性水管控制(LQC)问题(LQC)问题的全球线性趋同。设置包括无固定成本的随机性LQC问题,并允许在目标中增加催化性整流器。我们考虑的是持续性高斯政策,其平均值在州变量中是线性,其共性是国家独立的。与离散时间问题相反,成本在政策中是非强制性的,而不是所有下行方向都会导致相接的迭代。我们建议分别使用Fisheral 几何和 Bures-Wasserstein 几何测法来计算政策的平均值和共性差值梯度。该政策显示它符合优先约束条件,并在全球范围内与最佳政策一致,同时采用线性率。我们进一步提议采用新的PG方法,采用离子时间政策。算法利用持续时间分析,并实现不同行动频率的强线性趋同。一个数字实验证实了拟议的算法的趋同性和坚固性。