RAG还是微调？基于LCMs的工业代码补全对比研究 (RAG or Fine-tuning? A Comparative Study on LCMs-based Code Completion in Industry)

Code completion, a crucial practice in industrial settings, helps developers improve programming efficiency by automatically suggesting code snippets during development. With the emergence of Large Code Models (LCMs), this field has witnessed significant advancements. Due to the natural differences between open-source and industrial codebases, such as coding patterns and unique internal dependencies, it is a common practice for developers to conduct domain adaptation when adopting LCMs in industry. There exist multiple adaptation approaches, among which retrieval-augmented generation (RAG) and fine-tuning are the two most popular paradigms. However, no prior research has explored the trade-off of the two approaches in industrial scenarios. To mitigate the gap, we comprehensively compare the two paradigms including Retrieval-Augmented Generation (RAG) and Fine-tuning (FT), for industrial code completion in this paper. In collaboration with Tencent's WXG department, we collect over 160,000 internal C++ files as our codebase. We then compare the two types of adaptation approaches from three dimensions that are concerned by industrial practitioners, including effectiveness, efficiency, and parameter sensitivity, using six LCMs. Our findings reveal that RAG, when implemented with appropriate embedding models that map code snippets into dense vector representations, can achieve higher accuracy than fine-tuning alone. Specifically, BM25 presents superior retrieval effectiveness and efficiency among studied RAG methods. Moreover, RAG and fine-tuning are orthogonal and their combination leads to further improvement. We also observe that RAG demonstrates better scalability than FT, showing more sustained performance gains with larger scales of codebase.

翻译：代码补全是工业环境中的一项关键实践，通过开发过程中自动推荐代码片段来帮助开发者提升编程效率。随着大型代码模型（LCMs）的出现，该领域已取得显著进展。由于开源代码库与工业代码库在编码模式、独特内部依赖等方面存在天然差异，开发者在工业场景中采用LCMs时通常需要进行领域适配。现有多种适配方法，其中检索增强生成（RAG）与微调是最主流的两种范式。然而，先前研究尚未探讨这两种方法在工业场景中的权衡关系。为弥补这一空白，本文针对工业代码补全任务，全面比较了检索增强生成（RAG）与微调（FT）两种范式。通过与腾讯WXG部门合作，我们收集了超过16万个内部C++文件作为代码库。随后使用六种LCMs，从工业实践者关注的三个维度——有效性、效率与参数敏感性——对两类适配方法进行比较。研究发现：当采用合适的嵌入模型（将代码片段映射为稠密向量表示）时，RAG能够获得比单独微调更高的准确率。具体而言，在研究的RAG方法中，BM25展现出最优的检索效果与效率。此外，RAG与微调具有正交性，二者结合能带来进一步性能提升。我们还观察到RAG比FT具有更好的可扩展性，在更大规模代码库上能保持更持续的性能增益。