Program synthesis from input-output (IO) examples has been a long-standing challenge. While recent works demonstrated limited success on domain-specific languages (DSL), it remains highly challenging to apply them to real-world programming languages, such as C. Due to complicated syntax and token variation, there are three major challenges: (1) unlike many DSLs, programs in languages like C need to compile first and are not executed via interpreters; (2) the program search space grows exponentially when the syntax and semantics of the programming language become more complex; and (3) collecting a large-scale dataset of real-world programs is non-trivial. As a first step to address these challenges, we propose LaSynth and show its efficacy in a restricted-C domain. More specifically, LaSynth learns the latent representation to approximate the execution of partially generated programs, even if they are incomplete in syntax (addressing (1)). The learned execution significantly improves the performance of next token prediction over existing approaches, facilitating search (addressing (2)). Finally, once trained with randomly generated ground-truth programs and their IO pairs, LaSynth can synthesize more concise programs that resemble human-written code. Furthermore, retraining our model with these synthesized programs yields better performance with fewer samples for both Karel and C program synthesis, indicating the promise of leveraging the learned program synthesizer to improve the dataset quality for input-output program synthesis (addressing (3)). When evaluating on whether the program execution outputs match the IO pairs, LaSynth achieves 55.2% accuracy on generating simple C code with tens of tokens including loops and branches, outperforming existing approaches without executors by around 20%.
翻译:投入-输出(IO)示例中的程序合成是一个长期的挑战。虽然最近的工作显示在特定域语言(DSL)上取得了有限的成功,但将这些数据应用于真实世界的编程语言(如C.)仍然具有极大的挑战性。 由于复杂的语法和象征变异,有三大挑战:(1) 不同于许多 DSL, C等语言的程序需要首先编译,而不是通过口译员执行;(2) 当编程语言的语法和语义变得更加复杂时,程序搜索空间会成倍增长;(3) 收集大量真实世界程序的数据集是非三重力的。作为应对这些挑战的第一步,我们建议LaSynth,并在限制-C域显示其效力。 更具体地说,LaSynth学会了接近部分生成程序执行的潜在前景,即使它们不完全的语法(处理(1) ) 执行过程会大大改进当前方法的象征性预测的性能,便于搜索(处理(2) 最后,一旦经过随机生成的地序程序及其IO配对流程进行了培训, 包括Sermaly prillimal 程序在C 上进行更精确的编程。 LaSyn 将改进了目前的数据编程, 程序与现在的数学程序进行更精确地化。