Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.
翻译:基础模型(例如,CodBERT、GreabCodeBERT、CodT5)在许多软件工程任务中运作良好,这些模型先用数十亿个代码符号进行预先训练(使用自我监督),然后用数十万个标注的例子进行微调,通常从许多项目中提取,但软件现象可能非常具体,每个项目都有不同的项目;因此,关于具体项目数据的培训以及同一项目的测试是一个很有希望的想法。这一假设必须仔细评估,例如在时间序列中,以防止培训测试渗漏。我们比较了几种模型和培训方法,包括相同的项目培训、跨项目培训、特别设计为高效样本设计的培训模型(因此表面看来完全适合在有限的相同项目环境中学习),以及一种最大程度的混合方法,首先对许多项目进行多种语文的微调,然后对同一项目进行培训。我们发现,在不同的项目中,最高混合制的设置为州和州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州