We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that our approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.
翻译:我们处理背景化代码检索,寻找代码片断有助于填补部分输入程序空白。我们的方法通过将源代码随机地分解到背景和目标,促进大规模自我监督的对比性培训。为了消除两者之间的渗漏,我们建议采用基于相互识别码掩码、分辨和选择与语法一致的目标的新办法。我们的第二个贡献是建立一个新的数据集,用于直接评估背景化代码检索,该数据集基于一组人工匹配代码克隆子通道的数据集。我们的实验表明,我们的方法极大地改进了检索,并产生了新的最新数据,用于代码克隆和缺陷检测。