Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code (i.e., code that is greater than 256 tokens). Unlike the long text document that can be regarded as a whole with complete semantics, the semantics of long code is discontinuous as a piece of long code may contain different code modules. Therefore, it is unreasonable to directly apply the long text processing methods to long code. To tackle the long code problem, we propose MLCS (Modeling Long Code for Code Search) to obtain a better representation for long code. Our experimental results show the effectiveness of MLCS for long code retrieval. With MLCS, we could use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. Through AST-based splitting and attention-based fusion methods, MLCS achieves an overall mean reciprocal ranking (MRR) score of 0.785, outperforming the previous state-of-the-art result of 0.713 on the public CodeSearchNet benchmark.
翻译:由于以变换器为基础的培训前模式,代码搜索的性能有了很大的改进,然而,由于多头自我注意和GPU记忆的限制,输入代号长度是有限制的。现有的预先训练的代码模型,如GreabCodeBERT、DCBERT、RoBERTA(代码),默认地使用第一个256个代号,这使得它们无法代表长码(即代码大于256个代号)的完整信息。与长文本文件相比,长代码的语义可以被视为一个完整的语义,长代码的语义是中断的,因为长代码的拼写可能包含不同的代号模块。因此,直接将长文本处理方法直接应用到长代码是不合理的。为了解决长代码问题,我们建议MLAS(代码搜索长代码的代号)来更好地代表长码(即代码检索的代号)的完整信息。我们的实验结果表明,以变换器为基础的长代码预培训模型可以不改变其内部结构,也不重复了长代码的代号中长代码的代号,因为长代码的代号可能包含不同的代号。因此,直接应用长代码对长代码应用长代码的长代码的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号,在前的代号总的代号的代号的代号的代号,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中可以不改变其内部结构上,在前的代号中,在前的代号中,在前的代号中,在前的代号中,在前的代号中可以改变了不同的是前的代号中,在前的代号中,在前的代号中,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在前,在