Code search aims to retrieve accurate code snippets based on a natural language query to improve software productivity and quality. With the massive amount of available programs such as (on GitHub or Stack Overflow), identifying and localizing the precise code is critical for the software developers. In addition, Deep learning has recently been widely applied to different code-related scenarios, e.g., vulnerability detection, source code summarization. However, automated deep code search is still challenging due to the semantic gap between the program and the natural language query. Most existing deep learning-based approaches for code search rely on the sequential text i.e., feeding the program and the query as a flat sequence of tokens to learn the program semantics while the structural information is not fully considered. Furthermore, the widely adopted Graph Neural Networks (GNNs) have proved the effectiveness in learning program semantics, however, they also suffer the problem of capturing the global dependency in the constructed graph, which limits the model learning capacity. To address these challenges, in this paper, we design a novel neural network framework, named GraphSearchNet, to enable an effective and accurate source code search by jointly learning rich semantics of both source code and natural language queries. Specifically, we propose to construct graphs for the source code and queries with bidirectional GGNN (BiGGNN) to capture the local structural information of the source code and queries. Furthermore, we enhance BiGGNN by utilizing the multi-head attention module to supplement the global dependency that BiGGNN missed to improve the model learning capacity. The extensive experiments on Java and Python programming language from the public benchmark CodeSearchNet confirm that GraphSearchNet outperforms current state-of-the-art works by a significant margin.
翻译:代码搜索旨在检索基于自然语言查询的准确代码片段, 以提高软件生产率和质量。 由于大量可用的程序( 在 GitHub 或 Stack Overflow 上), 精确代码的识别和本地化对于软件开发者至关重要 。 此外, 深层学习最近被广泛应用于不同的代码相关情景, 例如脆弱性检测、 源代码和合成。 然而, 由于程序与自然语言查询之间的语义差异, 自动深度代码搜索仍然具有挑战性。 大部分现有的基于深层学习的代码搜索方法依赖于顺序文本, 即( 在GitHub 或 Stack Oververproduction) 。 由于大量可用的程序, (在没有充分考虑结构信息的情况下), 将程序和本地代码(GNFINet) 的匹配过程, 从而通过构建一个全新的神经网络网络网络框架框架框架框架, 并将其作为一个平坦的代号代码搜索工具 。 。 并且通过我们通过数据库搜索, 来构建一个有效的和准确的源代码,, 来强化本地的代码搜索。