Deriving a good variable selection strategy in branch-and-bound is essential for the efficiency of modern mixed-integer programming (MIP) solvers. With MIP branching data collected during the previous solution process, learning to branch methods have recently become superior over heuristics. As branch-and-bound is naturally a sequential decision making task, one should learn to optimize the utility of the whole MIP solving process instead of being myopic on each step. In this work, we formulate learning to branch as an offline reinforcement learning (RL) problem, and propose a long-sighted hybrid search scheme to construct the offline MIP dataset, which values the long-term utilities of branching decisions. During the policy training phase, we deploy a ranking-based reward assignment scheme to distinguish the promising samples from the long-term or short-term view, and train the branching model named Branch Ranking via offline policy learning. Experiments on synthetic MIP benchmarks and real-world tasks demonstrate that Branch Rankink is more efficient and robust, and can better generalize to large scales of MIP instances compared to the widely used heuristics and state-of-the-art learning-based branching models.
翻译:在分支和受约束的解决方案中制定良好的变量选择战略,对于现代混合整数编程(MIP)解决方案的效率至关重要。随着在前一个解决方案进程中收集的MIP分支数据,学习分支方法最近已经超越了疲劳症。由于分支和受约束自然是一项顺序决策任务,人们应该学会优化整个分支和受约束的解决方案解决进程的效用,而不是在每一步上都是短视的。在这项工作中,我们把向分支学习作为脱线强化学习(RL)问题,并提出建立离线MIP数据集的长远愿景混合搜索计划,该数据集重视分支决定的长期效用。在政策培训阶段,我们采用基于分级的奖励分配计划,将有希望的样本与长期或短期的样本区分开来,并通过离线政策学习来培训名为分支的分支模式。关于合成的 MIP基准和现实世界任务实验表明,分行的Rankink更有效、更稳健,并且能够比广泛使用的黑头学习模式和州学习模式更好地概括大型的MIP实例。