This work studies reinforcement learning (RL) in the context of multi-period supply chains subject to constraints, e.g., on production and inventory. We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in RL. Our approach is based on Constrained Policy Optimization (CPO), which is subject to approximation errors that in practice lead it to converge to infeasible policies. We address this issue by incorporating aspects of distributional RL into DCPO. Specifically, we represent the return and cost value functions using neural networks that output discrete distributions, and we reshape costs based on the associated confidence. Using a supply chain case study, we show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training. The proposed method also improves predictability, greatly reducing the variance of returns between runs, respectively; this result is significant in the context of policy gradient methods, which intrinsically introduce significant variance during training.
翻译:这项工作研究在多期供应链中加强学习(RL),但受制约,例如生产和库存方面的制约。我们采用了分配限制政策优化(DCPO),这是在RL中可靠限制满意度的一种新颖办法。我们采用的方法是以受限制政策优化(CPO)为基础,因为受约束政策优化(CPO)有近似错误,在实践中导致它与不可行的政策趋同。我们通过将分配RL的各个方面纳入DCPO来解决这一问题。具体地说,我们代表利用神经网络输出分散分布的神经网络的回报和成本价值功能,我们根据相关信心重塑成本。我们利用供应链案例研究,我们表明DCPO提高了RL政策汇合率,并确保在培训结束时获得可靠的限制满意度。拟议方法还提高了可预测性,大大降低了运行之间回报的差异。这在政策梯度方法方面意义重大,在培训期间必然产生显著差异。