Natural language processing (NLP) tasks (text classification, named entity recognition, etc.) have seen revolutionary improvements over the last few years. This is due to language models such as BERT that achieve deep knowledge transfer by using a large pre-trained model, then fine-tuning the model on specific tasks. The BERT architecture has shown even better performance on domain-specific tasks when the model is pre-trained using domain-relevant texts. Inspired by these recent advancements, we have developed NukeLM, a nuclear-domain language model pre-trained on 1.5 million abstracts from the U.S. Department of Energy Office of Scientific and Technical Information (OSTI) database. This NukeLM model is then fine-tuned for the classification of research articles into either binary classes (related to the nuclear fuel cycle [NFC] or not) or multiple categories related to the subject of the article. We show that continued pre-training of a BERT-style architecture prior to fine-tuning yields greater performance on both article classification tasks. This information is critical for properly triaging manuscripts, a necessary task for better understanding citation networks that publish in the nuclear space, and for uncovering new areas of research in the nuclear (or nuclear-relevant) domains.
翻译:在过去几年里,自然语言处理(NLP)任务(文本分类、名称实体识别等)取得了革命性的进展。这归功于诸如BERT等语言模型,这些语言模型通过使用大型的预先培训模式实现深层次知识转让,然后对具体任务模型进行微调。BERT架构在模型使用与域有关的文本进行预先培训时,在特定领域任务方面表现得更好。我们根据这些最新进展,开发了NukelM,这是美国能源部科学和技术信息办公室(OSTI)数据库的150万摘要中预先培训的核语言模型。这个NukelM模型随后经过微调,将研究文章分类为二进制类(与核燃料循环有关或不有关)或与文章主题有关的多种类别。我们显示,在微调两个文章分类任务上,对成果效果进行更大程度的改进之前,继续预先培训BERT型结构。这一信息对于正确拼写手稿至关重要,这是在核空间上发布研究的新领域(发现新的领域)更好地理解引用网络的必要任务。