Code search aims to retrieve the accurate code fragments based on a natural language query to improve the software productivity and quality. However, automated deep code search is still challenging due to the semantic gap between the program and the natural language query. Most existing deep learning-based approaches for code search rely on the sequential text eg., feeding the program and the query as a flat sequence of tokens to learn the program semantics and the structural information for both program and the query is not fully considered. Furthermore, the widely adopted Graph Neural Networks (GNNs) have proved the effectiveness in learning program semantics, however they also suffer from capturing the global dependency between any pair of nodes in the constructed graph, which hinder the model learning capacity. In this paper, to address these challenges, we design a novel neural network framework, named GraphSearchNet, to enable an effective and accurate source code search by jointly learning rich semantics of both source code and natural language queries. Specifically, we propose to encode both source code and queries into two separated graphs with Bidirectional GGNN to capture the local structural information of the programs and queries. We further enhance it by utilizing the effective multi-head attention mechanism to supplement the global dependency that BiGGNN missed to improve the model learning capacity. The extensive experiments on both Java and Python language from the public benchmark illustrate that GraphSearchNet outperforms current state-of-the-art works by a significant margin. We further conduct a quantitative analysis based on the real queries to further illustrate the effectiveness of our approach.
翻译:代码搜索的目的是在自然语言查询的基础上检索精确的代码碎片,以提高软件生产率和质量。然而,由于程序与自然语言查询之间的语义差异,自动深度代码搜索仍然具有挑战性。大多数现有的深层次基于学习的代码搜索方法依赖于顺序文字,例如,将程序与查询作为统一象征序列,用于学习程序语义和对程序的结构信息,而查询没有得到充分考虑。此外,广泛采用的图形神经网络(GNNS)已证明了学习程序语义质量的有效性,但是,由于在构建的图形中找到任何一组节点之间的全球依赖性,从而阻碍了模型学习能力。为了应对这些挑战,我们设计了一个新的神经网络框架,名为Greaph SearchNet, 以便通过共同学习丰富的源代码和自然语言查询的语义来进行有效和准确的源代码搜索。具体地说,我们提议将源代码和查询都编码成两个分离的图形,与Bidirectal GNNNNN进一步分解, 以捕捉到当前程序的结构信息,从而从数据库中获取大量的基础实验能力。我们用多式的模型来进一步改进全球模型,从而改进全球数据库的模型,从而改进全球数据库的模型的模型,从而改进全球数据库的模型的模型,从而改进全球数据库的模型的学习。