Efficiently representing source code is essential for various software engineering tasks such as code classification and code clone detection. Existing approaches for representing source code primarily use AST, and only a few works focus on semantic graphs such as CFG and PDG, which contain essential information about source code that AST does not have. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations against a single appropriate representation for the task. Moreover, they use hand-crafted program features to solve a specific task and have limited use cases. The primary goal of this paper is to discuss the implications of utilizing multiple code representations, specifically AST, CFG, and PDG, and how each of them affects the performance of a task. In this process, we use an approach that can use program features from multiple code graphs while not specifically coupling this approach to a specific task or a language. Our approach stems from the idea of modeling AST as a set of paths and using a learning model to capture program properties. We modify an existing AST path-based approach to accept multiple code representations as input. We do this since it allows us to measure the performance boost provided by additional representations over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Code Clone Detection. Our approach increases the performance on these three tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. We discuss the impact of semantic features from the CFG and PDG paths on performance and the additional overheads incurred through our approach. We envision this work providing researchers with a lens to evaluate combinations of source code representations for various tasks.
翻译:有效表示源代码对于各种软件工程任务(例如代码分类和代码克隆检测)至关重要。现有的源代码表示方法主要使用抽象语法树(AST),只有少数研究关注语义图(如 CFG 和 PDG ),这些图包含 AST 没有的重要信息。尽管一些研究尝试利用多种表示,但它们并没有提供使用多个表达式相对于单个适当表达式对任务的成本和收益的任何见解。此外,它们使用手工制作的程序功能来解决特定任务,并且具有受限的用例。本文的主要目标是讨论利用多种代码表示方式(特别是 AST,CFG 和 PDG)及其影响任务表现的影响。在此过程中,我们使用一个可以使用多个代码图形中的程序功能但不特定于特定任务或语言的方法。我们的方法来自将 AST 建模为路径集的想法,并使用学习模型来捕获程序属性。我们修改了现有的基于 AST 路径的方法以接受多种代码表示形式作为输入。我们这样做是因为它允许我们测量额外表示方式对 AST 的性能提升。我们在三项任务上评估我们的方法:方法命名,程序分类和代码克隆检测。我们的方法相对于基线提高了这三项任务的性能,分别为 11%(F1),15.7%(准确率)和 9.3%(F1)。我们讨论了来自 CFG 和 PDG 路径的语义特征对性能的影响以及我们的方法带来的附加开销。我们预计此工作为研究人员提供了一个评估各种任务的源代码表示组合的视角。