There is a growing interest in integrating machine learning techniques and optimization to solve challenging optimization problems. In this work, we propose a deep reinforcement learning methodology for the job shop scheduling problem (JSSP). The aim is to build up a greedy-like heuristic able to learn on some distribution of JSSP instances, different in the number of jobs and machines. The need for fast scheduling methods is well known, and it arises in many areas, from transportation to healthcare. We model the JSSP as a Markov Decision Process and then we exploit the efficacy of reinforcement learning to solve the problem. We adopt an actor-critic scheme, where the action taken by the agent is influenced by policy considerations on the state-value function. The procedures are adapted to take into account the challenging nature of JSSP, where the state and the action space change not only for every instance but also after each decision. To tackle the variability in the number of jobs and operations in the input, we modeled the agent using two incident LSTM models, a special type of deep neural network. Experiments show the algorithm reaches good solutions in a short time, proving that is possible to generate new greedy heuristics just from learning-based methodologies. Benchmarks have been generated in comparison with the commercial solver CPLEX. As expected, the model can generalize, to some extent, to larger problems or instances originated by a different distribution from the one used in training.
翻译:在这项工作中,我们提出一个强化学习方法,使该代理人采取的行动受到关于国家价值功能的政策考虑的影响。目的是建立一种贪婪的累赘主义,能够了解JSP案例的某些分布,这在工作数量和机器方面是不同的。对快速时间安排方法的需要是众所周知的,它产生于从运输到保健等许多领域。我们把JSP模拟成一个Markov 决策程序,然后我们利用强化学习的功效来解决这个问题。我们采用了一个演员-critic 计划,使该代理人采取的行动受到关于国家价值功能的政策考虑的影响。这些程序经过调整,以考虑到JSP具有挑战性的性质,即状态和行动空间不仅在每一个情况下都发生变化,而且在每个决定之后也是如此。为了解决投入中工作和业务数量的变异性,我们用两种事件LSTM模型来模拟该代理人,一种特殊的神经网络。实验显示算法在很短的时间内就找到了良好的解决方案,证明从新的贪婪的模型可以产生新的Curistimical 比较,从学习到一种预期的Cristimical 方法,从一种不同的学习过程可以产生一种不同的分析。