The goal of natural language semantic code search is to retrieve a semantically relevant code snippet from a fixed set of candidates using a natural language query. Existing approaches are neither effective nor efficient enough towards a practical semantic code search system. In this paper, we propose an efficient and accurate semantic code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the performance of the top K results from the fast retrieval. To further reduce the high memory cost of deploying two separate models in practice, we propose to jointly train the fast and slow model based on a single transformer encoder with shared parameters. The proposed cascaded approach is not only efficient and scalable, but also achieves state-of-the-art results with an average mean reciprocal ranking (MRR) score of 0.7795 (across 6 programming languages) as opposed to the previous state-of-the-art result of 0.713 MRR on the CodeSearchNet benchmark.
翻译:自然语言语义代码搜索的目标是从一组使用自然语言查询的固定候选人中取回一个具有语义相关性的代码片断。 现有方法对于一个实用的语义代码搜索系统既不够有效,效率也不够。 在本文中,我们提出一个高效和准确的语义代码搜索框架,使用级联快速和慢速模型,学习一个快速变压器编码模型,优化快速检索的可缩放指数,随后学习一个缓慢的基于分类的重排模型,以改进快速检索的顶级K结果的性能。为了进一步降低在实践中部署两个独立模型的高记忆成本,我们提议根据具有共享参数的单一变压器编码器联合培训快速和慢式模型。 拟议的串联方法不仅高效和可扩展,而且实现了最新的结果,平均对等分数为0.7795(跨6个编程语言),而代号码SearchNet基准的0.713 MRR 。