Recently there have been many advances in research on language modeling of source code. Applications range from code suggestion and completion to code summarization. However, complete program synthesis of industry-grade programming languages has not been researched extensively. In this work, we introduce a variational autoencoder model for program synthesis of industry-grade programming languages. Our model incorporates the internal hierarchical structure of source codes and operates on parse trees. By learning a latent representation of source code over trees, we capture more information and achieve a higher performance than standard autoregressive autoencoder models. Furthermore, due to the tree-structured nature of our model, the autoregressive operations are performed on paths of trees instead of linear sequences. Therefore, the size of the sequences that the autoregressive model processes, scales proportionally to the width and depth of the tree instead of the total size of the tree which mitigates the common problem of exploding and vanishing gradients.
翻译:最近,在源代码的语言建模研究方面取得了许多进展。应用范围从代码建议和完成到代码汇总等,但行业级编程语言的完整程序合成尚未广泛研究。在这项工作中,我们引入了工业级编程语言程序合成的可变自动编码模型模型。我们的模型结合了源代码的内部等级结构,并在剖析树上操作。通过在树上学习源代码的潜在代表,我们获取了更多的信息,并取得了比标准的自动递增自动编码模型更高的性能。此外,由于我们模型的树结构性质,自动递增操作是在树道上而不是直线序列上进行的。因此,自动递增模型过程的顺序大小与树的宽度和深度成比例,而不是减缓爆炸和消失梯度这一常见问题的树的总尺寸。