Source code clones pose risks ranging from intellectual property violations to unintended vulnerabilities. Effective and efficient scalable clone detection, especially for diverged clones, remains challenging. Large language models (LLMs) have recently been applied to clone detection tasks. However, the rapid emergence of LLMs raises questions about optimal model selection and potential LLM-ensemble efficacy. This paper addresses the first question by identifying 76 LLMs and filtering them down to suitable candidates for large-scale clone detection. The candidates were evaluated on two public industrial datasets, BigCloneBench, and a commercial large-scale dataset. No uniformly 'best-LLM' emerged, though CodeT5+110M, CuBERT and SPTCode were top-performers. Analysis of LLM-candidates suggested that smaller embedding sizes, smaller tokenizer vocabularies and tailored datasets are advantageous. On commercial large-scale dataset a top-performing CodeT5+110M achieved 39.71\% precision: twice the precision of previously used CodeBERT. To address the second question, this paper explores ensembling of the selected LLMs: effort-effective approach to improving effectiveness. Results suggest the importance of score normalization and favoring ensembling methods like maximum or sum over averaging. Also, findings indicate that ensembling approach can be statistically significant and effective on larger datasets: the best-performing ensemble achieved even higher precision of 46.91\% over individual LLM on the commercial large-scale code.
翻译:源代码克隆带来的风险范围广泛,从知识产权侵权到意外漏洞均包含其中。有效且高效的可扩展克隆检测,特别是针对发散型克隆,仍然具有挑战性。大型语言模型(LLMs)最近已被应用于克隆检测任务。然而,LLMs的快速涌现引发了关于最优模型选择以及潜在LLM集成效能的问题。本文通过识别76个LLMs并将其筛选至适合大规模克隆检测的候选模型,以解决第一个问题。这些候选模型在两个公开的工业数据集(BigCloneBench)和一个商业大规模数据集上进行了评估。尽管没有出现普遍意义上的“最佳LLM”,但CodeT5+110M、CuBERT和SPTCode表现最佳。对LLM候选模型的分析表明,较小的嵌入维度、较小的分词器词汇表以及定制数据集具有优势。在商业大规模数据集上,表现最佳的CodeT5+110M达到了39.71%的精确率:是先前使用的CodeBERT精确率的两倍。针对第二个问题,本文探索了所选LLMs的集成方法:这是一种以较低成本提升检测效力的有效途径。结果表明分数归一化的重要性,并且相较于平均法,更倾向于使用最大值或求和等集成方法。此外,研究发现集成方法在更大数据集上可能具有统计显著性且更为有效:在商业大规模代码上,表现最佳的集成模型获得了比单个LLM更高的46.91%的精确率。