Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain.
翻译:变异器是目前许多领域最先进的自然语言处理技术,也在软件工程研究中使用牵引法。这些模型在大量数据(通常是一般领域的数据)上事先经过培训。然而,我们对软件工程领域变异器的有效性只有有限的理解,即这些模型在软件工程背景下理解文字和句子有多好,以及这如何改善最新技术。在本条中,我们阐明了这一复杂但至关重要的问题。我们把经过软件工程数据培训的BERT变异器模型与基于多维通用域数据的变异器进行了比较:它们的词汇、理解缺少的词的能力以及它们在分类任务中的性能。我们的结果显示,对于需要了解软件工程背景的任务,对软件工程数据进行预先培训是有价值的,而一般域模型对于一般语言理解也足够,在软件工程领域也是如此。