Recently, deep reinforcement learning (DRL) frameworks have shown potential for solving NP-hard routing problems such as the traveling salesman problem (TSP) without problem-specific expert knowledge. Although DRL can be used to solve complex problems, DRL frameworks still struggle to compete with state-of-the-art heuristics showing a substantial performance gap. This paper proposes a novel hierarchical problem-solving strategy, termed learning collaborative policies (LCP), which can effectively find the near-optimum solution using two iterative DRL policies: the seeder and reviser. The seeder generates as diversified candidate solutions as possible (seeds) while being dedicated to exploring over the full combinatorial action space (i.e., sequence of assignment action). To this end, we train the seeder's policy using a simple yet effective entropy regularization reward to encourage the seeder to find diverse solutions. On the other hand, the reviser modifies each candidate solution generated by the seeder; it partitions the full trajectory into sub-tours and simultaneously revises each sub-tour to minimize its traveling distance. Thus, the reviser is trained to improve the candidate solution's quality, focusing on the reduced solution space (which is beneficial for exploitation). Extensive experiments demonstrate that the proposed two-policies collaboration scheme improves over single-policy DRL framework on various NP-hard routing problems, including TSP, prize collecting TSP (PCTSP), and capacitated vehicle routing problem (CVRP).
翻译:最近,深入强化学习(DRL)框架显示出了解决NP-硬性路由问题的潜力,例如旅行销售员问题(TSP),没有具体问题的专家知识。虽然DRL框架可以用来解决复杂的问题,但DRL框架仍然在与表现差距巨大的最先进的超常主义竞争中挣扎。本文提出一种新的等级解决问题战略,称为学习合作政策(LCP),它可以通过两种迭接的DRL政策有效地找到近于最佳的解决方案:渗漏和审校。渗漏产生尽可能多样化的候选解决方案(种子),同时致力于探索整个组合行动空间(即任务行动顺序 ) 。为此,我们用简单而有效的正规化奖赏来训练SODR政策,鼓励Server找到不同的解决方案。另一方面,修订者修改者将每个候选人的选址分割为小路段,同时修改每个子段的候选解决方案(种子),同时修改每个子段的选项,以尽量减少旅行距离(即任务序列行动) 。为此, 修订者们培训的策略将改进了一次试算方法, 改进了一次试算方法。