Reinforcement learning has lead to considerable break-throughs in diverse areas such as robotics, games and many others. But the application to RL in complex real-world decision making problems remains limited. Many problems in operations management (inventory and revenue management, for example) are characterized by large action spaces and stochastic system dynamics. These characteristics make the problem considerably harder to solve for existing RL methods that rely on enumeration techniques to solve per step action problems. To resolve these issues, we develop Programmable Actor Reinforcement Learning (PARL), a policy iteration method that uses techniques from integer programming and sample average approximation. Analytically, we show that the for a given critic, the learned policy in each iteration converges to the optimal policy as the underlying samples of the uncertainty go to infinity. Practically, we show that a properly selected discretization of the underlying uncertain distribution can yield near optimal actor policy even with very few samples from the underlying uncertainty. We then apply our algorithm to real-world inventory management problems with complex supply chain structures and show that PARL outperforms state-of-the-art RL and inventory optimization methods in these settings. We find that PARL outperforms commonly used base stock heuristic by 44.7% and the best performing RL method by up to 12.1% on average across different supply chain environments.
翻译:强化学习导致在机器人、游戏和许多其他领域出现相当大的突破。但在复杂的现实世界决策问题中,对RL的应用仍然有限。操作管理(例如库存和收入管理)中的许多问题具有巨大的行动空间和随机系统动态的特征。这些特点使得依靠查点技术解决一步行动问题的现有RL方法的问题更加难以解决。为了解决这些问题,我们开发了可编程的动作强化学习(PARL),这是一种政策迭代方法,使用整数编程和样本平均近似的技术。从分析上看,我们显示,对于某个特定评论家来说,每个版本中学习的政策与最佳政策汇合,作为不确定性的基本样本走向无限。实际上,我们表明,适当选择的不确定分布的离散化可以产生接近最佳的行为者政策,即使基本不确定性的样本很少。我们随后将我们的算法应用于复杂的供应链结构中真实世界库存管理的问题,并表明,PARL 超越了最先进的RL 和44号平均库存环境。我们发现,通过这些环境中采用最佳的44号平均供应和库存优化方法,我们发现这些系统中采用最佳的RL 。