强化学习中处理制约因素的动态惩罚功能方法 (Dynamic penalty function approach for constraints handling in reinforcement learning)

Reinforcement learning (RL) is attracting attentions as an effective way to solve sequential optimization problems involving high dimensional state/action space and stochastic uncertainties. Many of such problems involve constraints expressed by inequalities. This study focuses on using RL to solve such constrained optimal control problems. Most of RL application studies have considered inequality constraints as soft constraints by adding penalty terms for violating the constraints to the reward function. However, while training neural networks to represent the value (or Q) function, a key step in RL, one can run into computational issues caused by the sharp change in the function value at the constraint boundary due to the large penalty imposed. This difficulty during training can lead to convergence problems and ultimately poor closed-loop performance. To address this problem, this study suggests the use of a dynamic penalty function which gradually and systematically increases the penalty factor during training as the iteration episodes proceed. First, we examined the ability of a neural network to represent an artificial value function when uniform, linear, or dynamic penalty functions are added to prevent constraint violation. The agent trained by a Deep Q Network (DQN) algorithm with the dynamic penalty function approach was compared with agents with other constant penalty functions in a simple vehicle control problem. Results show that the dynamic penalty approach can improve the neural network's approximation accuracy and that brings faster convergence to a solution closer to the optimal solution.

翻译：强化学习(RL)正在吸引人们的注意,作为解决涉及高维度状态/行动空间和随机不确定因素的连续优化问题的有效途径,许多这类问题涉及不平等表现的制约因素。本研究的重点是利用RL解决这种受限制的最佳控制问题。大多数RL应用研究认为,不平等限制是软制约,因为违反奖励功能的限制增加了惩罚条件。然而,培训神经网络代表价值(或Q)功能,这是RL的关键步骤,但人们可能会遇到由于施加的巨额处罚而使制约边界的功能值发生急剧变化而造成的计算问题。培训过程中的这种困难可能导致趋同问题,最终导致闭路运行不良的绩效。为解决这一问题,该研究表明,在培训过程中,使用动态惩罚功能会逐渐和系统地增加惩罚因素。首先,我们研究了神经网络在统一、线性或动态处罚功能增加时代表人造价值功能的能力,以防止违反约束。由深Q网络(DQN)培训的代理人,与更精确的动态惩罚功能相比,将更精确的方法与更精确的动态的车辆控制工具相比较,可以提高动态的汇率,从而更精确地显示更精确的动力的车辆控制。