This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments.
翻译:本文介绍汉密尔顿- 贾科比- 贝勒曼最佳政策(HJBPPO)算法作为强化学习的一部分。 汉密尔顿- 贾科比- 贝尔曼(HJB) 等式用于控制理论以评价价值函数的最佳性。 我们的工作将HJB等式与连续状态的强化学习以及改善价值网络培训的行动空间结合起来。 我们把价值网络视为物理化神经网络(PINN),通过精确计算其投入的衍生物来解析HJB等式。 Proximal政策优化(PPPO)- Clipped 算法与这一执行结合,因为它使用一个价值网络来计算其政策网络的目标功能。 HJBPPO算法显示,与PPO相比, MuJoCo环境的性能有所改善。