Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.
翻译:深层学习模型被成功地应用于各种软件工程任务,如代码分类、汇总、错误和脆弱性检测等。为了对这些任务进行深层次的学习,源代码需要以适合输入深层学习模式的格式表示。代表源代码的大多数方法,如象征、抽象语法树(ASTs)、数据流程图(DFGs)以及控制流程图(CFGs),只是侧重于代码本身,而没有考虑对深层学习模式有用的其他背景。在本文件中,我们提出深层学习模型获得关于正在分析的代码的其他背景信息是有益的。我们提出初步证据,说明调用层次的背景以及代码本身的信息可以改进两种软件工程任务的最新深层学习模式的性能。我们概述了我们为为源代码表达而增加更多背景信息以用于深层学习的研究议程。