This work shows how one can use large-scale language models (LMs) to synthesize programming problems with verified solutions, in the form of programming puzzles, which can then in turn be used to fine-tune those same models, improving their performance. This work builds on two recent developments. First, LMs have achieved breakthroughs in non-trivial reasoning and algorithm implementation, generating code that can solve some intermediate-level competitive programming problems. However, training code LMs involves curated sets of natural-language problem descriptions and source-code tests and solutions, which are limited in size. Second, a new format of programming challenge called a programming puzzle was introduced, which does not require a natural language description and is directly specified by a source-code test. In this work we show how generating synthetic programming puzzles and solutions, verified for correctness by a Python interpreter, can be used to improve performance in solving test puzzles from P3, a public benchmark set of Python Programming Puzzles. Additionally, we release a dataset of 1 million puzzles and solutions generated by the Codex model, which we show can improve smaller models through fine-tuning.
翻译:这项工作表明,人们如何能够使用大型语言模型(LMS),以编程拼图的形式,将编程问题与经核实的解决方案(编程拼图)综合起来,然后用这种拼图来微调同样的模型,改进它们的绩效。这项工作以最近两个事态发展为基础。首先,LMS在非三边推理和算法实施方面取得了突破,产生了能够解决某些中等竞争性编程问题的代码。然而,培训代码LMS涉及一套在规模上有限的成套自然语言问题描述和源代码测试和解决方案。第二,引入了一种称为编程拼图的编程挑战的新格式,它不需要自然语言描述,直接由源码测试指定。在这项工作中,我们展示了如何生成合成的编程拼图和解决方案,并经过Python翻译校准,可以用来改进P3的测试谜题的性能。 Python编程图谱的一套公共基准。此外,我们发布了一套由代码模型生成的100万个谜题和解决方案的数据集,我们可以通过微调改进小模型。