With the increase in the number of open repositories and discussion forums, the use of natural language for semantic code search has become increasingly common. The accuracy of the results returned by such systems, however, can be low due to 1) limited shared vocabulary between code and user query and 2) inadequate semantic understanding of user query and its relation to code syntax. Siamese networks are well suited to learning such joint relations between data, but have not been explored in the context of code search. In this work, we evaluate Siamese networks for this task by exploring multiple extraction network architectures. These networks independently process code and text descriptions before passing them to a Siamese network to learn embeddings in a common space. We experiment on two different datasets and discover that Siamese networks can act as strong regularizers on networks that extract rich information from code and text, which in turn helps achieve impressive performance on code search beating previous baselines on $2$ programming languages. We also analyze the embedding space of these networks and provide directions to fully leverage the power of Siamese networks for semantic code search.
翻译:随着开放储存库和讨论论坛数量的增加,自然语言用于语义代码搜索的情况越来越普遍,自然语言用于语义代码搜索的情况也越来越普遍。然而,由于以下原因,这些系统返回的结果的准确性可能较低:(1) 代码和用户查询之间共享的词汇有限,(2) 对用户查询及其与代码语法关系的语义理解不足。暹罗网络非常适合学习数据之间的这种联合关系,但在代码搜索方面没有探索。在这项工作中,我们通过探索多个提取网络结构来评估暹罗网络的任务。这些网络独立处理代码和文本描述,然后将其传递给一个siames网络,以学习在共同空间的嵌入。我们在两个不同的数据集上进行实验,发现暹罗网络可以在从代码和文本中提取丰富信息的网络上充当强有力的规范者,这反过来有助于在代码搜索上取得令人印象深刻的业绩,在2美元的编程语言上打过以前的基线。我们还分析了这些网络的嵌入空间,并提供了方向,以充分利用Siames网络的力量来进行语义代码搜索。