Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks like GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source codes. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals' efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion models often output suspicious results, and to date, an automated testing and enhancement framework for code completion models is not available. This research proposes CCTEST, a framework to test and repair code completion systems in blackbox settings. CCTEST features a novel mutation strategy, namely program structure-consistency (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing likely erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the "average" appearance of all output cases, as the final output of the code completion systems. We detected a total of 33,540 inputs that can trigger likely erroneous cases from eight popular LLM-based code completion systems. With repairing, we show that the performance of code completion models notably increased by 53.51% on average.
翻译:代码的完成是软件开发领域一个非常宝贵的主题,最近大型语言模型(LLMS)的进步也日益推动使用。迄今为止,GitHub Copilt和GPT等可见的基于LLM的代码完成框架通过对大量无结构文本和开放源代码的深入学习得到了培训。作为日常编程任务的首要组成部分和基石,代码的完成在很大程度上提高了专业人员在建设真实世界软件系统方面的效率。与这个繁荣的市场相比,我们发现代码完成模式常常产生可疑的结果,而迄今为止,代码完成模型的自动化测试和强化框架尚未建立。本研究提出了CCTEST,这是一个在黑盒环境中测试和修理代码完成系统的框架。计算机CTST具有一种新的突变战略,即程序结构一致性(PSC)突变,以产生变异的代码完成输入。随后,它发现与所有完成代码案例不一致,代表了所有已完成的错误案例。此外,CCTEST对代码完成产出进行了修复,主要反映了所有产出的“平均”外观,是黑盒完成系统的最后产出。我们从8个错误的完成模型中发现了完成模型。我们发现了一个可能的完成模式。