Much of software-engineering research relies on the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus is tedious, time-consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpora and estimate the language naturalness that is relative to a specific style of programming or type of project. To overcome these issues, we propose using pre-trained language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. To this end, we infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models' ability to predict them. To this end, we evaluate three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task irrespective of the outcome. We implement this workflow, named CodeBERT-nt, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2510 buggy versions of 40 projects from the SmartShark dataset, show that CodeBERT-nt outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results (slightly better) than the n-gram models.
翻译:大量软件工程研究依赖于代码的自然性, 代码在小代码片段中的代码是重复性的, 并且可以使用诸如 n- gram 等统计语言模型预测。 虽然强大的, 在大型代码堆中培训这些模型是乏味的, 耗时的, 并且对培训过程中遇到的代码模式( 和做法) 十分敏感。 因此, 这些模型往往在小公司接受培训, 并估计语言自然性, 与特定的程序或项目类型相比。 为了克服这些问题, 我们提议使用预训练的语言模型来推断代码自然性。 预训练模型往往建立在大数据上, 容易在纸箱外的方法中使用, 并且包含强大的学习联系机制。 我们的主要想法是, 通过使用目前最精细的配置语言模型( 缩略) 来量化代码的自然性。 为了克服这些问题, 我们建议使用预先训练的语言模型, 代码模型的精度, 代码的精度, 代码的精度, 和模型的精度, 测试模型的精度, 以及模型的精度, 的精度, 我们评估三个不同的智能的精确度, 的精确度, 的精确的精确的计算, 运行, 运行, 运行的精度, 运行的精度, 的精度, 运行的精度, 运行的精度, 的精度, 的精度, 的精度, 运行的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 运行的精度, 运行的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 。