Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.
翻译:正式核查软件属性是一项非常可取但劳动密集型的任务。最近的工作已经开发出一些方法,用证据助理(如Coq和Isabelle/HOL)进行正式核查自动化,例如,培训一个模型来一次预测一个证据步骤,并使用该模型在可能的证明空间中搜索可能的证明。本文引入了一个新的方法,使正式核查自动化:我们使用大型语言模型,在自然语言文本和代码方面受过培训,并精细调整了校准,以便一次性地而不是一次地一次地为理论库生成整个证据。我们把这一证据生成模型与一个经过精细调整的修理模型结合起来,以修复生成的证据,进一步增强验证能力。作为其主要贡献,本文首次表明:(1) 使用变压器进行全体核查是可能的,而且与基于搜索的技术一样有效,而无需花费昂贵的搜索。(2) 向所学的模型附加了额外的背景,例如以前的证明尝试失败和随后的错误信息,导致证据的修复和进一步的自动化证据生成。(3) 我们为完全的自动证据合成而设计新状态,我们用它来进行新的证据合成。 我们用原型模型、BALL36 展示了整个实验性检验和实验性检验方法,我们用整个模型来展示了整个的模型, 展示了整个模型, 展示了整个的模型,我们用来展示了整个工具的模型, 展示了整个实验性地展示了它。</s>