We propose an end-to-end learning framework based on hierarchical reinforcement learning, called H-TSP, for addressing the large-scale Travelling Salesman Problem (TSP). The proposed H-TSP constructs a solution of a TSP instance starting from the scratch relying on two components: the upper-level policy chooses a small subset of nodes (up to 200 in our experiment) from all nodes that are to be traversed, while the lower-level policy takes the chosen nodes as input and outputs a tour connecting them to the existing partial route (initially only containing the depot). After jointly training the upper-level and lower-level policies, our approach can directly generate solutions for the given TSP instances without relying on any time-consuming search procedures. To demonstrate effectiveness of the proposed approach, we have conducted extensive experiments on randomly generated TSP instances with different numbers of nodes. We show that H-TSP can achieve comparable results (gap 3.42% vs. 7.32%) as SOTA search-based approaches, and more importantly, we reduce the time consumption up to two orders of magnitude (3.32s vs. 395.85s). To the best of our knowledge, H-TSP is the first end-to-end deep reinforcement learning approach that can scale to TSP instances of up to 10000 nodes. Although there are still gaps to SOTA results with respect to solution quality, we believe that H-TSP will be useful for practical applications, particularly those that are time-sensitive e.g., on-call routing and ride hailing service.
翻译:我们提出了一种基于分层强化学习的端到端学习框架 H-TSP,用于解决大规模旅行商问题(TSP)。所提出的 H-TSP 构建了一个从头开始的 TSP 实例解决方案,依赖于两个组件:上层策略从要遍历的所有节点中选择一个小的节点子集(在我们的实验中最多达到200个),而下层策略将选择的节点作为输入,并输出连接它们到现有部分路径(最初仅包含集散地)的巡回路线。在联合训练上层和下层策略后,我们的方法可以直接为给定的 TSP 实例生成解决方案,而无需依赖于任何耗时的搜索过程。为了证明所提出的方法的有效性,我们对具有不同节点数的随机生成的 TSP 实例进行了大量实验。我们显示 H-TSP 可以达到与SOTA搜索方法可比较的结果(间隙为3.42% vs. 7.32%),更重要的是,我们将时间消耗减少了两个量级(3.32秒 vs. 395.85秒)。据我们所知,H-TSP 是第一个可以应用于多达10000个节点的 TSP 实例的端到端深度强化学习方法。尽管在解决方案质量方面仍存在差距,但我们认为 H-TSP 将非常有用于实际应用,特别是那些时间敏感的应用,如呼叫路由和打车服务。