利用GNNs进行没有监督的学习一般政策 (Learning Generalized Policies Without Supervision Using GNNs)

We consider the problem of learning generalized policies for classical planning domains using graph neural networks from small instances represented in lifted STRIPS. The problem has been considered before but the proposed neural architectures are complex and the results are often mixed. In this work, we use a simple and general GNN architecture and aim at obtaining crisp experimental results and a deeper understanding: either the policy greedy in the learned value function achieves close to 100% generalization over instances larger than those used in training, or the failure must be understood, and possibly fixed, logically. For this, we exploit the relation established between the expressive power of GNNs and the $C_{2}$ fragment of first-order logic (namely, FOL with 2 variables and counting quantifiers). We find for example that domains with general policies that require more expressive features can be solved with GNNs once the states are extended with suitable "derived atoms" encoding role compositions and transitive closures that do not fit into $C_{2}$. The work follows the GNN approach for learning optimal general policies in a supervised fashion (Stahlberg, Bonet, Geffner, 2022); but the learned policies are no longer required to be optimal (which expands the scope, as many planning domains do not have general optimal policies) and are learned without supervision. Interestingly, value-based reinforcement learning methods that aim to produce optimal policies, do not always yield policies that generalize, as the goals of optimality and generality are in conflict in domains where optimal planning is NP-hard.

翻译：我们考虑的是,如何从取消的STTRIP中代表的小例子中学习经典规划领域的通用政策。问题以前曾考虑过,但拟议的神经结构复杂,结果往往不一。在这项工作中,我们使用一个简单和一般的GNN架构,目的是获得精确的实验结果和更深入的理解:要么在学习的价值观功能中,政策贪婪在比培训中使用的要大的情况下达到接近100%的概括化,要么必须理解失败,并可能从逻辑上加以修正。为此,我们利用GNN在GNN的表达力和$C2}美元规划逻辑(即FOL,具有2个变量和计数量化符)之间建立起来的关系。我们发现,举例来说,如果国家以适合的“从原子”编码角色构成和过渡性关闭的方式扩展,那么,这种失败就必须被理解为不符合$C2}基础的过渡性关闭。我们采用GNNU的方法来学习最优化的总体政策(Starberg,Bonet,Geffner) 和一级逻辑(即FOL,具有2变量和计分数分数分数分数的分数的分数的逻辑 ) 。我们发现,一般政策在最优化政策中,在最优化政策中,最优化政策中不需要学习到最优化政策中,最优化的政策是最优化的政策是最优化的政策是最优化的政策,最优化的阶段化,而不是最优化的政策在学习的方面,最优化的政策是最优化政策,而不是最优化政策,在学习最优化的政策,而不是最优化的方面,而不是最优化政策,在最优化的方面,而不是最优化的政策是最优化的政策是最优化的政策,在学习程度。