In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problem (TSP). Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator.
翻译:在本文中,我们引入了两个玩家零和框架,一个在可训练的 emph{Solver} 和 emph{Data Ganger} 之间,两个玩家零和框架,以提高旅行推销员问题(TSP) 的深层次学习求解器的普及能力。基于\ textsl{Policyspace Respace Response Oracle} (PSRO) 的方法,我们的双玩家框架输出出一个反应最佳的解答器群,我们可以混合并产生一个组合模型,在发电机上实现最小的利用性,从而在不同的 TSP 任务上实现最普遍的性能。我们用不同类型和大小的各种 TSP 进行实验。 结果表明, 我们的解答器即使在解答器从未满足的任务上达到最先进的性能, 而其他深层次的解答器的性能却因过度适应而急剧下降。 为了展示我们的框架原则,我们研究拟议的两玩家游戏的学习结果, 并证明溶者在培训期间的可利用性下降, 最终与发电机一起接近于纳什平衡。