采用强化学习方法解决库存减少的库存问题 (A Reinforcement Learning Approach to the Stochastic Cutting Stock Problem)

We propose a formulation of the stochastic cutting stock problem as a discounted infinite-horizon Markov decision process. At each decision epoch, given current inventory of items, an agent chooses in which patterns to cut objects in stock in anticipation of the unknown demand. An optimal solution corresponds to a policy that associates each state with a decision and minimizes the expected total cost. Since exact algorithms scale exponentially with the state-space dimension, we develop a heuristic solution approach based on reinforcement learning. We propose an approximate policy iteration algorithm in which we apply a linear model to approximate the action-value function of a policy. Policy evaluation is performed by solving the projected Bellman equation from a sample of state transitions, decisions and costs obtained by simulation. Due to the large decision space, policy improvement is performed via the cross-entropy method. Computational experiments are carried out with the use of realistic data to illustrate the application of the algorithm. Heuristic policies obtained with polynomial and Fourier basis functions are compared with myopic and random policies. Results indicate the possibility of obtaining policies capable of adequately controlling inventories with an average cost up to 80% lower than the cost obtained by a myopic policy.

翻译：我们建议制定随机切口储量问题,将其作为一个折扣无限偏顺马尔科夫决定程序。在每次决策阶段,根据目前的项目清单,一个代理商选择根据未知需求削减库存量的模式。一个最佳解决方案相当于将每个国家与一项决定挂钩并尽量减少预期总成本的政策。由于精确的算法规模与州-空间层面成倍增长,我们根据强化学习,制定了一种超常解决办法。我们建议了一种近似的政策迭代算法,在其中我们应用线性模型来接近政策的行动价值功能。政策评价是通过从模拟获得的州过渡、决定和成本抽样中解决预测的贝尔曼方程式。由于决定空间大,政策改进是通过交叉种植法进行的。由于利用现实数据来说明算法的应用,我们进行了比较实验。用多元和四倍基函数获得的超常政策与我的opy和随机政策进行比较。结果表明有可能获得能够以平均成本控制库存的政策,以我获得的平均成本达到80 %以上的政策。