Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.
翻译:尽管最近在大型语言模型(LLM)推理方面取得了成功,但LLMs与产生复杂程序等多步级推理任务的斗争仍然非常成功。对于这些任务,人类往往从高级算法设计开始,并逐步执行每个部分。我们引入了Parsel,这是一个允许自动实施和验证带有代码LLMs的复杂算法的框架,将自然语言的分级函数描述作为投入。我们表明,Parsel可以在需要分级推理的领域使用,包括程序合成、机器人规划和理论验证。我们显示,产生Parsel的LMS解决了APS数据集中更多的竞争层面问题,导致直接取样阿尔法考德和代码编码的通过率比以往高出75%以上,同时往往使用较小的样本预算。我们还发现,使用Parsel作为中间语言的LM生成的机器人计划比直接生成的计划更有可能被视为准确性的两倍多。最后,我们探索Parsel如何解决LM的限制,并讨论Parsel如何对人类程序员有用。