Efficiently representing source code is crucial for various software engineering tasks such as code classification and clone detection. Existing approaches primarily use Abstract Syntax Tree (AST), and only a few focus on semantic graphs such as Control Flow Graph (CFG) and Program Dependency Graph (PDG), which contain information about source code that AST does not. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations. The primary goal of this paper is to discuss the implications of utilizing multiple code representations, specifically AST, CFG, and PDG. We modify an AST path-based approach to accept multiple representations as input to an attention-based model. We do this to measure the impact of additional representations (such as CFG and PDG) over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection. Our approach increases the performance on these tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. In addition to the effect on performance, we discuss timing overheads incurred with multiple representations. We envision this work providing researchers with a lens to evaluate combinations of code representations for various tasks.
翻译:有效地表示源代码对于各种软件工程任务如代码分类和克隆检测至关重要。现有方法主要使用抽象语法树 (AST),只有少数方法关注语义图,如控制流图(CFG)和程序依赖图(PDG),它们包含了AST不具备的源代码信息。尽管一些研究试图利用多种表示,但它们没有提供使用多种表示的成本和收益方面的见解。本文的主要目标是讨论利用多种代码表示(特别是AST、CFG和PDG)的影响。我们修改了一种基于AST路径的方法,将多个表示作为输入传递给基于注意力的模型。我们这样做是为了衡量额外表示(如CFG和PDG)对AST的影响。我们在三个任务上评估了我们的方法:方法命名、程序分类和克隆检测。相比于基线,我们的方法分别提高了这些任务的性能11%(F1)、15.7%(准确率)和9.3%(F1)。除了性能影响外,我们还讨论了多种表示所带来的时间开销。我们设想这项工作提供了一个评估代码表示组合以针对不同任务的视角供研究人员使用的框架。