In this paper, we propose a novel reinforcement learning algorithm for inventory management of newly launched products with no historical demand information. The algorithm follows the classic Dyna-$Q$ structure, balancing the model-free and model-based approaches, while accelerating the training process of Dyna-$Q$ and mitigating the model discrepancy generated by the model-based feedback. Based on the idea of transfer learning, warm-start information from the demand data of existing similar products can be incorporated into the algorithm to further stabilize the early-stage training and reduce the variance of the estimated optimal policy. Our approach is validated through a case study of bakery inventory management with real data. The adjusted Dyna-$Q$ shows up to a 23.7\% reduction in average daily cost compared with $Q$-learning, and up to a 77.5\% reduction in training time within the same horizon compared with classic Dyna-$Q$. By using transfer learning, it can be found that the adjusted Dyna-$Q$ has the lowest total cost, lowest variance in total cost, and relatively low shortage percentages among all the benchmarking algorithms under a 30-day testing.
翻译:本文针对缺乏历史需求信息的新上市产品,提出了一种新颖的强化学习库存管理算法。该算法遵循经典的Dyna-$Q$框架,在无模型与基于模型的方法之间取得平衡,同时加速了Dyna-$Q$的训练过程,并缓解了基于模型反馈所产生的模型偏差。基于迁移学习的思想,算法可纳入来自现有类似产品需求数据的预热信息,从而进一步稳定早期训练阶段并降低估计最优策略的方差。我们通过一个基于真实数据的烘焙产品库存管理案例验证了所提方法。实验表明,改进的Dyna-$Q$相较于$Q$-学习实现了高达23.7%的日均成本降低,在相同训练周期内相较于经典Dyna-$Q$训练时间缩短最高达77.5%。通过引入迁移学习可发现,在30天的测试周期内,改进的Dyna-$Q$在所有基准算法中具有最低的总成本、最低的总成本方差以及相对较低的缺货率。