This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator. On the theoretical side, we present learnability results on a subclass of inventory control problems, where we provide a provable reduction of the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.
翻译:这项工作提供了一种深强化学习方法,以解决定期审查库存控制系统的方法,该系统的供应商周转时间、损失的销售、相关需求和价格匹配都具有随机性。虽然这一动态程序历来被认为是棘手的,但我们的结果表明,一些政策学习方法与传统方法具有竞争力或优于典型方法。为了培训这些算法,我们开发了将历史数据转换为模拟器的新技术。在理论方面,我们展示了库存控制问题亚类的可学习性结果,从中可以看出强化学习问题与受监督的学习问题之间的可证实的减少。在算法方面,我们展示了一种基于模型的强化学习程序(隐形背面程序),通过建立一个不同的模拟器来解决定期审查库存控制问题。在各种标准直接反向偏差模型的模型RL和新闻供应商基线下,在模拟和现实世界部署中都是如此。