While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code language models' capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we develop a cross-file context finder tool, CCFINDER, that effectively locates and retrieves the most relevant cross-file context. We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion when the cross-file context is provided.
翻译:虽然经过事先训练的代码语言模型(LM)在代码完成方面取得了巨大成功,但它们生成的代码仅以文件中的内容为条件,即档案中的内容,但忽略了同一项目中其他文件中丰富的语义,即跨文件背景,这是现代模块化软件开发中特别有用的关键信息来源。这种忽略限制代码语言模型在代码完成中的能力,导致出人意料的行为,例如生成了幻觉类成员函数或功能调用意外参数。在这项工作中,我们开发了一个跨文件背景查找工具(CCFINDER),有效地定位和检索了最相关的跨文件背景。我们提议了COCOCIC,这是一个包含跨文件背景的框架,以在预先训练的代码LMS上共同学习文档中和交叉背景。 CoCOCIC成功地改进了现有的代码LM,精确匹配率提高了19.30%,在提供跨文件背景的情况下,代码完成的标识匹配率提高了15.41%。