GitHub commits, which record the code changes with natural language messages for description, play a critical role for software developers to comprehend the software evolution. To promote the development of the open-source software community, we collect a commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Furthermore, we unify a ``commit intelligence'' framework with one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBARTsignificantly outperforms previous pre-trained works for code. Further analysis also reveals each pre-training task enhances the model performance.
翻译:GitHub 承诺用自然语言信息记录代码变化,用于描述自然语言信息,为软件开发者理解软件演化发挥着关键作用。为了促进开放源代码软件界的发展,我们收集了一个承诺基准,包括799万以上7种编程语言的799万个承诺基准。基于这一基准,我们介绍了GitHub 承诺的大型预培训编码解码器-解码器变异模型。该模型预先培训了六种培训前任务(即解译目标、跨现代生成和对比学习),以学习分解。此外,我们将“承诺情报框架”与一项理解任务和三代承诺任务统一起来。关于这些任务的全面实验表明,“承诺”系统大大超越了先前经过培训的代码工作。进一步的分析还揭示了每项培训前任务能够增强模型的性能。