Graph neural networks have been shown to produce impressive results for a wide range of software engineering tasks. Existing techniques, however, still have two issues: (1) long-term dependency and (2) different code components are treated as equals when they should not be. To address these issues, we propose a method for representing code as a hierarchy (Code Hierarchy), in which different code components are represented separately at various levels of granularity. Then, to process each level of representation, we design a novel network architecture, ECHELON, which combines the strengths of Heterogeneous Graph Transformer Networks and Tree-based Convolutional Neural Networks to learn Abstract Syntax Trees enriched with code dependency information. We also propose a novel pretraining objective called Missing Subtree Prediction to complement our Code Hierarchy. The evaluation results show that our method significantly outperforms other baselines in three tasks: any-code completion, code classification, and code clone detection.
翻译:现有技术仍有两个问题:(1) 长期依赖性和(2) 不同的代码组件在不应有的情形下被同等对待。为了解决这些问题,我们提议了一种方法,将代码代表成一个等级(Code 等级),其中不同的代码组件在不同层次的颗粒中分别代表。然后,为了处理每个层次的表述,我们设计了一个新型的网络结构ECHELON,它结合了异质图形变异网络和基于树木的动态神经网络的优势,以学习用代码依赖性信息丰富起来的抽象语库树。我们还提出了一个名为“失踪子树预测”的新的培训前目标,以补充我们的代码等级。评价结果显示,我们的方法在三个任务中大大超越了其他基线:任何代码的完成、代码分类和代码的克隆探测。