Efficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of MethodNaming using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.
翻译:源代码的有效表达方式对于诸如代码搜索和代码克隆检测等各种软件工程任务至关重要。 代表源代码的一种技术就是从 AST 中提取路径, 并且使用学习模型来捕捉程序属性。 代码2vec 是一种常用的基于路径的方法, 使用基于关注的神经网络来学习代码嵌入, 然后可用于各种软件工程任务。 但是, 这种方法只使用 AST, 并且没有利用控制流程图( CFG) 和 程序依赖性图( PDG) 等其他图形结构。 同样, 代表源代码的最新方法中大多数仍然使用 AST, 并且不使用语义图结构结构结构结构结构结构结构。 尽管存在一种代表源代码的综合图形方法( Code 属性图), 但它只是在软件安全领域探索。 此外, 它不能利用各个图表的路径。 在我们的工作中, 我们扩展基于路径的方法代码方法的代码, 包括语义图、 CFG1 和 PDG, 以及 AST 的源, 这在软件工程的域域中基本上还没有被解析定的线条线。 我们从 CSD2 的 SealG 数据模型中, 正在用一个数据模型项目里用一个新的数据格式化的方法, 。 我们用一个从 Cmamad 的方法, 向一个从 Crodrodrodrodrodrodrod 向一个新的路径向一个新的路径向一个新的路径向一个数据模型向一个新的路径, 。