We propose a novel approach to interactive theorem-proving (ITP) using deep reinforcement learning. Unlike previous work, our framework is able to prove theorems both end-to-end and from scratch (i.e., without relying on example proofs from human experts). We formulate the process of ITP as a Markov decision process (MDP) in which each state represents a set of potential derivation paths. The agent learns to select promising derivations as well as appropriate tactics within each derivation using deep policy gradients. This structure allows us to introduce a novel backtracking mechanism which enables the agent to efficiently discard (predicted) dead-end derivations and restart the derivation from promising alternatives. Experimental results show that the framework provides comparable performance to that of the approaches that use human experts, and that it is also capable of proving theorems that it has never seen during training. We further elaborate the role of each component of the framework using ablation studies.
翻译:我们提出了一种利用深层强化学习进行互动理论验证(ITP)的新办法。与以往的工作不同,我们的框架能够证明从端到端和从零的理论(即不依赖人类专家的范例证明 ) 。我们把IPP进程设计成一个Markov 决策程序(MDP ), 每个国家在其中代表着一套潜在的衍生路径。 代理商学会了利用深层政策梯度在每个衍生中选择有希望的产物和适当的策略。 这一结构使我们能够引入一个新的回溯跟踪机制, 使代理商能够高效率地丢弃( 预测的) 致命产物并从有希望的替代物中重新开始衍生物。 实验结果显示, 框架提供了与使用人类专家的方法相似的效绩, 并且它能够证明在培训期间从未见过的标本。 我们用匹配研究进一步阐述了框架每个组成部分的作用。