Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy gradient and the projection, which results in sample-inefficient training and slow convergence. To tackle this issue, we propose a learning algorithm that decouples the action constraints from the policy parameter update by leveraging state-wise Frank-Wolfe and a regression-based policy update scheme. Moreover, we show that the proposed algorithm enjoys convergence and policy improvement properties in the tabular case as well as generalizes the popular DDPG algorithm for action-constrained RL in the general case. Through experiments, we demonstrate that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
翻译:受行动制约的强化学习(RL)是各种现实应用中广泛采用的一种方法,例如,在资源受限的网络系统中安排时间,控制有运动受限的机器人;虽然现有的基于预测的方法确保零约束违反,但由于政策梯度和预测的紧密结合,导致抽样效率低下的培训和缓慢的趋同,因此它们可能受到零等级问题的影响。为了解决这一问题,我们提议一种学习算法,通过利用国家明智的Frank-Wolfe和基于回归的政策更新计划,将行动制约与政策参数更新脱钩。此外,我们表明,拟议的算法在表格中具有趋同性和政策改进的特性,并在一般情况下一般情况下普遍采用受行动限制的DDPG算法。我们通过实验表明,拟议的算法大大超出了各种控制任务的基准方法。