The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and evaluate it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language "and" source code data, it also outperforms an information retrieval baseline based on Lucene. Also, we demonstrated that combined use of an information retrieval-based approach followed by a Transformer, leads to the best results overall, especially when searching into a large search pool. Furthermore, transfer learning is particularly effective when much pre-training data is available and fine-tuning data is limited. We demonstrate that natural language processing models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search. With the development of Transformer models designed more specifically for dealing with source code data, we believe the results on source code analysis tasks can be further improved.
翻译:变换器架构和传输学习标志着自然语言处理的飞跃,改善了一系列基于文本的任务的先进水平。本文件审视了这些进步如何应用于并改进代码搜索。为此,我们预先对基于自然语言和源代码数据组合的BERT模型进行了培训,并对StackOverflow问题标题和代码答案进行了评估。我们的结果表明,预先培训的模型始终优于未经过预先培训的模型。在对自然语言“和”源代码数据进行预先培训的情况下,该模型还优于基于Lucyene的信息检索基线。此外,我们证明,使用由变换器采用的信息检索方法相结合,可以取得最佳的总体结果,特别是在搜索一个大型搜索库时。此外,在具备许多培训前数据且微调数据有限的情况下,转让学习特别有效。我们证明,基于变换器结构的自然语言处理模型可以直接应用于源代码分析任务,例如代码搜索。随着变换器模型的开发,我们更具体地认为,可以用改进源代码分析结果。