We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input which makes $f$ return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f$ is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). We develop baseline enumerative program synthesis, GPT-3 and Codex solvers that are capable of solving puzzles -- even without access to any reference solutions -- by learning from their own past solutions. Codex performs best, solving up to 18% of 397 test problems with a single try and 80% of the problems with 1,000 tries per problem. In a small user study, we find a positive correlation between puzzle-solving performance and coding experience, and between the puzzle difficulty for humans and AI solvers. Therefore, further improvements on P3 could have a significant impact on many program synthesis areas.
翻译:我们引入了一种新的编程挑战类型,称为编程拼图,作为对程序合成的客观和全面的评估,并发布一个开放源码的Python编程拼图(P3)数据集。每个拼图由短的 Python 程序(f3) 美元来定义,目标是找到一个能让美元返回 True 的输入。拼图是客观的,因为每个拼图都完全由其核查器源代码(ff)来指定,因此,评价美元是测试候选解决方案所需要的一切。它们不需要答案关键或输入/产出示例,也不取决于自然语言理解。数据集是全面的,它涉及一系列困难和领域的问题,从细小的字符串操作问题到典型的编程谜题(例如河内塔),到访谈/竞争-方案问题(例如动态编程),以及算法和数学(例如计理算)中长期存在的问题。我们开发了多个直线化程序合成、GPT-3和代码解算器解算器,这些解算器能够解谜 -- 甚至无法从最小的字符拼图操作问题到18个用户的解算法领域,通过测试问题的任何测试问题来学习。