利用政策优化处理学习用车辆处理问题 (Learning Vehicle Routing Problems using Policy Optimisation)

Deep reinforcement learning (DRL) has been used to learn effective heuristics for solving complex combinatorial optimisation problem via policy networks and have demonstrated promising performance. Existing works have focused on solving (vehicle) routing problems as they have a nice balance between non-triviality and difficulty. State-of-the-art approaches learn a policy using reinforcement learning, and the learnt policy acts as a pseudo solver. These approaches have demonstrated good performance in some cases, but given the large search space typical combinatorial/routing problem, they can converge too quickly to poor policy. To prevent this, in this paper, we propose an approach name entropy regularised reinforcement learning (ERRL) that supports exploration by providing more stochastic policies, which tends to improve optimisation. Empirically, the low variance ERRL offers RL training fast and stable. We also introduce a combination of local search operators during test time, which significantly improves solution and complement ERRL. We qualitatively demonstrate that for vehicle routing problems, a policy with higher entropy can make the optimisation landscape smooth which makes it easier to optimise. The quantitative evaluation shows that the performance of the model is comparable with the state-of-the-art variants. In our evaluation, we experimentally illustrate that the model produces state-of-the-art performance on variants of Vehicle Routing problems such as Capacitated Vehicle Routing Problem (CVRP), Multiple Routing with Fixed Fleet Problems (MRPFF) and Travelling Salesman problem.

翻译：深加学习( DRL) 用于学习有效的超常理论,通过政策网络解决复杂的组合优化问题,并展示出有希望的业绩; 现有工作的重点是解决(车辆)路由问题,因为这些问题在非三角性和困难之间有着良好的平衡; 最先进的方法利用强化学习学习一项政策,所学的政策作为假解决方案。这些方法在某些情况中表现良好,但鉴于搜索空间很广,典型的组合/路线问题,它们可能过于迅速与不良政策交汇。为了防止这一点,我们在本文件中建议采用一种方法名称,即精选固定化的强化学习(ERRRL),通过提供更精选的政策来支持探索,这往往能改善优化。乐观地说,低差异的ERRRRL提供快速和稳定的培训。我们还在测试期间引入了本地搜索操作员的组合,大大改进了模型解决方案,补充了ERRRFL。我们从质量上证明,对于车辆的选路程问题,一个具有较高旋转的策略可以使选择的精度的精度强化强化强化强化强化强化强化强化强化强化学习学习(ERL) 学习(ERLLL),使我们能够比较地展示业绩。