With Reinforcement Learning (RL) for inventory management (IM) being a nascent field of research, approaches tend to be limited to simple, linear environments with implementations that are minor modifications of off-the-shelf RL algorithms. Scaling these simplistic environments to a real-world supply chain comes with a few challenges such as: minimizing the computational requirements of the environment, specifying agent configurations that are representative of dynamics at real world stores and warehouses, and specifying a reward framework that encourages desirable behavior across the whole supply chain. In this work, we present a system with a custom GPU-parallelized environment that consists of one warehouse and multiple stores, a novel architecture for agent-environment dynamics incorporating enhanced state and action spaces, and a shared reward specification that seeks to optimize for a large retailer's supply chain needs. Each vertex in the supply chain graph is an independent agent that, based on its own inventory, able to place replenishment orders to the vertex upstream. The warehouse agent, aside from placing orders from the supplier, has the special property of also being able to constrain replenishment to stores downstream, which results in it learning an additional allocation sub-policy. We achieve a system that outperforms standard inventory control policies such as a base-stock policy and other RL-based specifications for 1 product, and lay out a future direction of work for multiple products.
翻译:随着强化学习在库存管理领域的应用仍处于起步阶段,现有方法往往局限于简单、线性环境,并且实现上只是基于通用的强化学习算法进行一些微小修改。把这些简单的环境扩展到实际的供应链中会遇到一些挑战,例如:最小化环境计算要求、指定能够代表实际商店和仓库动态的代理配置,以及指定一个奖励框架,以鼓励整个供应链具有期望的行为。在本研究中,我们提出了一个具有自定义GPU并行环境的系统,该环境包括一个仓库和多个商店,一个新颖的代理和环境动态架构,包括增强的状态和行为空间,分享奖励说明旨在优化大型零售商的供应链需求。供应链图中的每个顶点都是独立的代理,根据其自身的库存,能够向上游顶点下达补货订单。除了向供应商下订单外,仓库代理还具有将补货限制在下游商店的特殊属性,这导致它学习一种附加的分配子策略。我们实现了一个优于标准库存控制策略(例如基础库存策略和其他基于强化学习的策略)的系统,适用于1种产品,同时给出了多种产品的未来研究方向。