Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach outperforms the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).
翻译:程序分类可以被视为一种高层次的代码抽象,为与源代码理解有关的各种任务奠定基础,并且具有软件工程领域非常广泛的应用,例如代码克隆检测、代码嗅觉分类、缺陷分类等。跨语言程序分类可以实现不同编程语言的代码传输,还可以促进跨语言代码再利用,从而帮助开发者快速编写代码并减少代码传输的开发时间。大多数现有研究侧重于代码的语义学习,而用于跨语言任务的研究则很少。跨语言程序分类的主要挑战是如何提取不同编程语言的语义学特征。为了应对这一困难,我们提议了一个统一的简易语系树(本文中的 UAST ) 神经网络。 详细来说, UAST 的核心理念包括两个统一的机制。 首先, UAST 通过统一AST Transiversal 序列和像图表一样的 AST 结构来采集语系代码特征。 其次,我们构建了一个叫做统一的词汇学的机制,这个机制可以减少我们所使用的不同语言的语系之间的语系 。