选择与组合大型语言模型以实现可扩展的代码克隆检测 (Selecting and Combining Large Language Models for Scalable Code Clone Detection)

Source code clones pose risks ranging from intellectual property violations to unintended vulnerabilities. Effective and efficient scalable clone detection, especially for diverged clones, remains challenging. Large language models (LLMs) have recently been applied to clone detection tasks. However, the rapid emergence of LLMs raises questions about optimal model selection and potential LLM-ensemble efficacy. This paper addresses the first question by identifying 76 LLMs and filtering them down to suitable candidates for large-scale clone detection. The candidates were evaluated on two public industrial datasets, BigCloneBench, and a commercial large-scale dataset. No uniformly 'best-LLM' emerged, though CodeT5+110M, CuBERT and SPTCode were top-performers. Analysis of LLM-candidates suggested that smaller embedding sizes, smaller tokenizer vocabularies and tailored datasets are advantageous. On commercial large-scale dataset a top-performing CodeT5+110M achieved 39.71\% precision: twice the precision of previously used CodeBERT. To address the second question, this paper explores ensembling of the selected LLMs: effort-effective approach to improving effectiveness. Results suggest the importance of score normalization and favoring ensembling methods like maximum or sum over averaging. Also, findings indicate that ensembling approach can be statistically significant and effective on larger datasets: the best-performing ensemble achieved even higher precision of 46.91\% over individual LLM on the commercial large-scale code.

翻译：源代码克隆带来的风险范围广泛，从知识产权侵权到意外漏洞均包含其中。有效且高效的可扩展克隆检测，特别是针对发散型克隆，仍然具有挑战性。大型语言模型（LLMs）最近已被应用于克隆检测任务。然而，LLMs的快速涌现引发了关于最优模型选择以及潜在LLM集成效能的问题。本文通过识别76个LLMs并将其筛选至适合大规模克隆检测的候选模型，以解决第一个问题。这些候选模型在两个公开的工业数据集（BigCloneBench）和一个商业大规模数据集上进行了评估。尽管没有出现普遍意义上的“最佳LLM”，但CodeT5+110M、CuBERT和SPTCode表现最佳。对LLM候选模型的分析表明，较小的嵌入维度、较小的分词器词汇表以及定制数据集具有优势。在商业大规模数据集上，表现最佳的CodeT5+110M达到了39.71%的精确率：是先前使用的CodeBERT精确率的两倍。针对第二个问题，本文探索了所选LLMs的集成方法：这是一种以较低成本提升检测效力的有效途径。结果表明分数归一化的重要性，并且相较于平均法，更倾向于使用最大值或求和等集成方法。此外，研究发现集成方法在更大数据集上可能具有统计显著性且更为有效：在商业大规模代码上，表现最佳的集成模型获得了比单个LLM更高的46.91%的精确率。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日