关于软件工程领域天然语言处理的预先培训的变压器的有效性问题 (On the validity of pre-trained transformers for natural language processing in the software engineering domain)

Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.

翻译：变异器是目前许多领域最先进的自然语言处理技术,也在软件工程研究中使用牵引法。这些模型在大量数据(通常是一般领域的数据)上事先经过培训。然而,我们对软件工程领域变异器的有效性只有有限的理解,即这些模型在软件工程背景下理解文字和句子有多好,以及这如何改善最新技术。在本条中,我们阐明了这一复杂但至关重要的问题。我们把经过软件工程数据培训的BERT变异器模型与基于多维通用域数据的变异器进行了比较:它们的词汇、理解缺少的词的能力以及它们在分类任务中的性能。我们的结果显示,对于需要了解软件工程背景的任务,对软件工程数据进行预先培训是有价值的,而一般域模型对于一般语言理解也足够,在软件工程领域也是如此。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

不可错过！斯坦福<人工智能疾病诊断与信息推荐>2021课程，附Slides下载

专知会员服务

47+阅读 · 2021年4月29日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日