An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and documentation. A significant challenge is to find a sequential representation that captures the structural/syntactic information in a computer program and facilitates the training of the learning models. In this paper, we propose to use the Pr\"ufer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. Empirical studies on real-world benchmark datasets, using a sequence-to-sequence learning model we designed for code summarization, show that our Pr\"ufer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.
翻译:对计算机程序源代码进行有效和高效的编码对于计算机程序理解任务(如自动代码总和和文档)的顺序到顺序深层次神经网络模型的成功至关重要。一个重大挑战是找到一个顺序代表,在计算机程序中捕捉结构/合成信息,便利学习模式的培训。在本文件中,我们提议使用计算机程序Special Singh troo(AST)的“Pr\'ufer”序列来设计一个顺序代表制,以保存AST中的结构信息。我们的代表制使得有可能开发深层学习模型,在这种模型中,可以自动和有选择地利用培训实例中以词汇符号传送的信号,而这种模型的组合作用和重要性是可自动和有选择的。与其他最近提议的方法不同,我们的代表制简洁和无损于AST的结构信息。关于真实世界基准数据集的“精神”研究,利用我们为代码总和设计的一个顺序到序列的学习模型,表明我们最近Pr\fer 后代代表制的模型确实高成效和高效地运用了我们的深层次模型。