随机最优控制的神经策略迭代：一种物理信息方法 (Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach)

We propose a physics-informed neural network policy iteration (PINN-PI) framework for solving stochastic optimal control problems governed by second-order Hamilton--Jacobi--Bellman (HJB) equations. At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy. This linear structure enables systematic $L^2$ error control at each policy evaluation step, and allows us to derive explicit Lipschitz-type bounds that quantify how value gradient errors propagate to the policy updates. This interpretability provides a theoretical basis for evaluating policy quality during training. Our method extends recent deterministic PINN-based approaches to stochastic settings, inheriting the global exponential convergence guarantees of classical policy iteration under mild conditions. We demonstrate the effectiveness of our method on several benchmark problems, including stochastic cartpole, pendulum problems and high-dimensional linear quadratic regulation (LQR) problems in up to 10D.

翻译：我们提出了一种物理信息神经网络策略迭代（PINN-PI）框架，用于求解由二阶Hamilton-Jacobi-Bellman（HJB）方程支配的随机最优控制问题。在每次迭代中，通过最小化由固定策略诱导的线性偏微分方程残差，训练一个神经网络来近似值函数。这种线性结构能够在每个策略评估步骤中实现系统的$L^2$误差控制，并允许我们推导显式的Lipschitz型界，以量化值函数梯度误差如何传播到策略更新中。这种可解释性为训练期间评估策略质量提供了理论基础。我们的方法将近期基于确定性PINN的方法扩展到随机设置，并在温和条件下继承了经典策略迭代的全局指数收敛性保证。我们在多个基准问题上验证了方法的有效性，包括随机倒立摆、摆锤问题以及高达10维的高维线性二次调节（LQR）问题。