The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.
翻译:在软件工程中应用深层学习技术越来越受欢迎。一个关键问题是开发高质量和易于使用的源代码,用于与代码有关的任务。研究界近年来取得了令人印象深刻的成果。然而,由于部署方面的困难和性能瓶颈,这些办法很少适用于该行业。在本文件中,我们为源代码代表提供基于ExXtreme摘要语库(AST)的神经网络,目的是将这一技术推向工业实践。拟议的 xASTNN 具有三个优点。首先, xASTNN 完全基于广泛使用的ASTs,不需要复杂的数据预处理,因此不需要复杂的数据预处理,使之适用于各种编程语言和实用设想。第二,提出了三种密切相关的设计,以保证 xASTNN 的有效性,包括代码自然特性的分树说明序列、合成信息封闭的循环单元,以及顺序信息的循环单元。第三,引入动态的分解算法可以大大降低 xASTNN 的时间复杂性。两种代码可以理解下游任务、代码分类和代码的分类和代码探测速度,同时能够改进我们采用的DNA的基线。</s>