Code understanding is an increasingly important application of Artificial Intelligence. A fundamental aspect of understanding code is understanding text about code, e.g., documentation and forum discussions. Pre-trained language models (e.g., BERT) are a popular approach for various NLP tasks, and there are now a variety of benchmarks, such as GLUE, to help improve the development of such models for natural language understanding. However, little is known about how well such models work on textual artifacts about code, and we are unaware of any systematic set of downstream tasks for such an evaluation. In this paper, we derive a set of benchmarks (BLANCA - Benchmarks for LANguage models on Coding Artifacts) that assess code understanding based on tasks such as predicting the best answer to a question in a forum post, finding related forum posts, or predicting classes related in a hierarchy from class documentation. We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning. We also show that multi-task training over BLANCA tasks helps build better language models for code understanding.
翻译:理解守则是人造情报的日益重要应用。理解守则的一个基本方面是理解守则的文本,例如文件和论坛讨论。预先训练的语言模式(例如BERT)是各种国家语言规划任务的一种普遍做法,现在有各种基准,例如GLUE,帮助改进自然语言理解模式的发展。然而,对于这些模式在文字艺术作品中如何很好地运用守则,我们很少知道这些模式如何很好地运用,而且我们不知道为进行这种评价而有任何一套系统化的下游任务。在本文件中,我们提出一套基准(BLANCA-关于编译艺术活动的局域网格模型基准),根据预测论坛职位问题的最佳答案、寻找相关的论坛职位或预测与班级文件等级有关的课程等任务评估对守则的理解。我们评估目前关于这些任务的最新语言模式的绩效,并表明从细调中可以明显改进每一项任务。我们还表明,对BLANCA任务的多任务培训有助于建立更好的语言理解模式。