The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and fine-tune it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language "and" source code data, it also outperforms an information retrieval baseline based on Lucene. Also, we demonstrated that the combined use of an information retrieval-based approach followed by a Transformer leads to the best results overall, especially when searching into a large search pool. Transfer learning is particularly effective when much pre-training data is available and fine-tuning data is limited. We demonstrate that natural language processing models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search. With the development of Transformer models designed more specifically for dealing with source code data, we believe the results of source code analysis tasks can be further improved.
翻译:转换器架构和传输学习标志着自然语言处理的飞跃,在一系列基于文本的任务中改进了最新工艺水平。本文件审视了这些进步如何应用于并改进代码搜索。为此,我们预先对基于自然语言和源代码数据组合的基于BERT的自然语言和源代码数据模型进行了培训,并将其微调到StackOversion 问题标题和代码答案的对配对上。我们的结果表明,预先培训的模型始终优于未经过预先培训的模型。在对自然语言“和”源代码数据进行预先培训的情况下,该模型也优于基于Lusteene的信息检索基线。此外,我们证明,同时使用由变换器采用的信息检索方法可取得最佳的总体结果,特别是在搜索大型搜索库时。当许多培训前数据可用且微调数据有限时,转让学习特别有效。我们证明,基于变换器结构的自然语言处理模型可以直接应用于源代码分析任务,例如代码搜索。我们更具体地认为,随着变换器模型的开发,我们更具体地认为处理源代码的源代码分析,我们能够更精确地进行数据源代码分析。