关于使用BERT模式完成法规完成程序的经验研究 (An Empirical Study on the Usage of BERT Models for Code Completion)

Code completion is one of the main features of modern Integrated Development Environments (IDEs). Its objective is to speed up code writing by predicting the next code token(s) the developer is likely to write. Research in this area has substantially bolstered the predictive performance of these techniques. However, the support to developers is still limited to the prediction of the next few tokens to type. In this work, we take a step further in this direction by presenting a large-scale empirical study aimed at exploring the capabilities of state-of-the-art deep learning (DL) models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks (e.g., the iterated block of a for loop). To this aim, we train and test several adapted variants of the recently proposed RoBERTa model, and evaluate its predictions from several perspectives, including: (i) metrics usually adopted when assessing DL generative models (i.e., BLEU score and Levenshtein distance); (ii) the percentage of perfect predictions (i.e., the predicted code snippets that match those written by developers); and (iii) the "semantic" equivalence of the generated code as compared to the one written by developers. The achieved results show that BERT models represent a viable solution for code completion, with perfect predictions ranging from ~7%, obtained when asking the model to guess entire blocks, up to ~58%, reached in the simpler scenario of few tokens masked from the same code statement.

翻译：代码完成是现代集成开发环境(IDE)的主要特征之一。它的目标是通过预测开发者可能写下下一个代号符号来加快代码写法速度。这一领域的研究大大增强了这些技术的预测性能。但是, 对开发者的支持仍然局限于预测下几个要输入的代号。在这项工作中, 我们向这个方向迈出了一步, 提出大规模的经验性研究, 旨在探索最先进的深层次掩码模型( DL) 模型在不同颗粒级别支持代码完成的能力, 包括单标号、一个或多个完整的声明, 直至整个代码区块( 例如循环的迭代号块) 。为此, 我们培训和测试最近提议的 RoBERTa 模型的若干经修改的变异, 从几个角度评估其预测, 包括:(一) 在评估整个模型的精度模型( 即, BLEU 分数和 Levenshtein 距离) 中通常采用的标准, 包括单标值的精度, 从单个符号、一个或多个完整的代号声明, 从一个预算的预算结果, 显示B. 三, 通过预估的代码完成结果, 和预算结果, 显示一个预算结果。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

专知会员服务

102+阅读 · 2020年4月25日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日