GitHub commits, which record the code changes with natural language messages for description, play a critical role for software developers to comprehend the software evolution. To promote the development of the open-source software community, we collect a commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Furthermore, we unify a "commit intelligence" framework with one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals each pre-training task enhances the model performance. We encourage the follow-up researchers to contribute more commit-related downstream tasks to our framework in the future.
翻译:GitHub 承诺将代码变化记录为自然语言信息进行描述,为软件开发者理解软件演化发挥着关键作用。为了促进开放源代码软件界的发展,我们收集了一个承诺基准,包括799万以上7种编程语言的799万个承诺基准。基于这一基准,我们介绍GitHub 承诺的大型预培训编码解码器变异模型CoDEBART。该模型预先培训了六种培训前任务(即解密目标、跨模式生成和对比学习),以学习分解。此外,我们将“承诺情报”框架与一项谅解任务和三代任务统一起来。关于这些任务的全面实验表明,CONBART大大超越了先前经过培训的代码工作。进一步的分析还揭示了每项培训前任务都加强了模型的性能。我们鼓励后续研究人员为未来框架做出更多承诺的下游任务。