We present an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. On capacitated VRP, our approach outperforms classical heuristics and Google's OR-Tools on medium-sized instances in solution quality with comparable computation time (after training). We demonstrate how our approach can handle problems with split delivery and explore the effect of such deliveries on the solution quality. Our proposed framework can be applied to other variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.
翻译:我们利用强化学习,提出了一个解决车辆运行问题端到端的框架。在这个方法中,我们训练一个单一模型,只通过观察奖励信号和遵守可行性规则,找到从特定分布中抽样解决问题的近乎最佳的解决方案。我们的模式代表了一种参数化的随机政策,通过应用政策梯度算法优化其参数,经过培训的模型产生解决方案,这是一系列连续的实时行动,而无需对每一个新的问题实例进行再培训。在功能化的VRP上,我们的方法优于传统的超常和谷歌的Or-tools,在中等规模的情况下,在具有类似计算时间的解决方案质量上(经过培训后),我们展示了我们的方法如何处理分送问题,并探索这种交付对解决方案质量的影响。我们提议的框架可以适用于VRP的其他变体,例如随机VRP,并有可能更广泛地应用于组合优化问题。