State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. (2020) and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs.
翻译:在这项工作中,我们建议通过强化学习(RL)从零开始学习分支规则。我们重新审视Etheve等人(2020年)的工作,并提议树马可夫决策程序(或树MDPs),这是时间性MDPs的概括化,为分支学习提供一个更合适的框架。我们从树政策梯度理论中得出一个比时间性对等更好的信用分配。我们通过计算实验来证明树MDPs改进了学习趋同,并为处理MILPs的学习到骨架问题提供了一个很有希望的框架。