POMO: 政策优化与 " 强化学习 " 多重最佳方案 (POMO: Policy Optimization with Multiple Optima for Reinforcement Learning)

In neural combinatorial optimization (CO), reinforcement learning (RL) can turn a deep neural net into a fast, powerful heuristic solver of NP-hard problems. This approach has a great potential in practical applications because it allows near-optimal solutions to be found without expert guides armed with substantial domain knowledge. We introduce Policy Optimization with Multiple Optima (POMO), an end-to-end approach for building such a heuristic solver. POMO is applicable to a wide range of CO problems. It is designed to exploit the symmetries in the representation of a CO solution. POMO uses a modified REINFORCE algorithm that forces diverse rollouts towards all optimal solutions. Empirically, the low-variance baseline of POMO makes RL training fast and stable, and it is more resistant to local minima compared to previous approaches. We also introduce a new augmentation-based inference method, which accompanies POMO nicely. We demonstrate the effectiveness of POMO by solving three popular NP-hard problems, namely, traveling salesman (TSP), capacitated vehicle routing (CVRP), and 0-1 knapsack (KP). For all three, our solver based on POMO shows a significant improvement in performance over all recent learned heuristics. In particular, we achieve the optimality gap of 0.14% with TSP100 while reducing inference time by more than an order of magnitude.

翻译：在神经组合优化(CO)中,强化学习(RL)可以将深神经网转化为快速、强大的NP-硬性问题的超光速解决器。这个方法在实际应用方面具有巨大的潜力,因为它使得在没有具备大量领域知识的专家指南的情况下,可以找到接近最佳的解决方案。我们引入了多功能政策优化(POMO),这是建设这样一个超常求解器的一种端对端方法。POMO适用于广泛的CO问题。它旨在利用在CO解决方案的表述中存在的对称性。POMO使用一种经过修改的REINFORCE算法,将各种推向所有最佳解决方案。从时间上看,POMO的低变量基准使得RL培训速度和稳定,与以前的方法相比,它更耐受当地迷你症的影响。我们还引入了一种新的基于增强能力的推论方法,它与POMOMO非常接近。我们通过解决三种流行的难题,即旅行销售人(TTP),100级的REINFORC算法,将各种推向所有最佳解决办法推向不同的推移,而不断升级的MRRRUPRVCS,在SBRVCS-BS-BS-BS-BS-BS-BS-BS-BS-BS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-T-SLOT-SBSLOT-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-T-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-