Code summarization aims to generate concise natural language descriptions of source code, which can help improve program comprehension and maintenance. Recent studies show that syntactic and structural information extracted from abstract syntax trees (ASTs) is conducive to summary generation. However, existing approaches fail to fully capture the rich information in ASTs because of the large size/depth of ASTs. In this paper, we propose a novel model CAST that hierarchically splits and reconstructs ASTs. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, AST representation, together with source code embedding obtained by a vanilla code token encoder, is used for code summarization. Extensive experiments, including the ablation study and the human evaluation, on benchmarks have demonstrated the power of CAST. To facilitate reproducibility, our code and data are available at https://anonymous.4open.science/r/CAST/.
翻译:代码总和旨在生成源代码的简明自然语言描述,这将有助于改善对程序的理解和维护。最近的研究表明,从抽象的合成树(ASTs)中提取的合成和结构信息有利于简易生成。然而,由于ASTs的大小/深度很大,现有方法未能完全捕捉ASTs中的丰富信息。在本文中,我们提出了一个新型的CAST模型,按等级划分和重建ASTs。首先,我们分等级地将一大批AST分为一组子树,并利用循环神经网络对子树进行编码。然后,我们通过重建分裂的ASTs,将子树的嵌入内容汇总起来,以获得完整的AST。最后,AST代表,连同由香尼拉代号编码编码编码编码所嵌入的源代码,都用于代码的校正。广泛的实验,包括缩影研究和人类评估,展示了CAST的力量。为了便于重新描述,我们在 https://onyous/copromscience/science上提供了我们的代码和数据。