关于临床文字采矿的跨域预先培训语言模型:它们如何在有数据约束的微调中发挥作用? (On Cross-Domain Pre-Trained Language Models for Clinical Text Mining: How Do They Perform on Data-Constrained Fine-Tuning?)

Pre-trained language models (PLMs) have been deployed in many natural language processing (NLP) tasks and in various domains. Language model pre-training from general or mixed domain rich data plus fine-tuning using small amounts of available data in a low resource domain demonstrated beneficial results by researchers. In this work, we question this statement and verify if BERT-based PLMs from the biomedical domain can perform well in clinical text mining tasks via fine-tuning. We test the state-of-the-art models, i.e. Bioformer which is pre-trained on a large amount of biomedical data from PubMed corpus. We use a historical n2c2 clinical NLP challenge dataset for fine-tuning its task-adapted version (BioformerApt), and show that their performances are actually very low. We also present our own end-to-end model, TransformerCRF, which is developed using Transformer and conditional random fields (CRFs) as encoder and decoder. We further create a new variation model by adding a CRF layer on top of PLM Bioformer (BioformerCRF). We investigate the performances of TransformerCRF on clinical text mining tasks by training from scratch using a limited amount of data, as well as the model BioformerCRF. Experimental evaluation shows that, in a \textit{constrained setting}, all tested models are \textit{far from ideal} regarding extreme low-frequency special token recognition, even though they can achieve relatively higher accuracy on overall text tagging. Our models including source codes will be hosted at \url{https://github.com/poethan/TransformerCRF}.

翻译：在很多自然语言处理(NLP)任务和各个领域中都部署了预先培训语言模型(PLM),来自普通或混合域丰富数据的语言模型预培训,以及使用少量低资源领域可用数据进行微调,显示了研究人员的有益结果。在这项工作中,我们质疑这一声明,并核实生物医学领域基于BERT的PLM(PLM)的PLM(PLM)是否能够通过微调很好地完成临床文本挖掘任务。我们测试最先进的生物模型,即预先接受来自普布Med文的大量生物医学数据培训的生物组。我们使用历史的n2c2临床高级数据,以及使用微调版本(BioExec2Appt)微调微调数据,显示其性能实际上非常低。我们还展示了自己的端对端模型变换CRF(CRF),这是使用变压器和测试过的下限随机字段(CRF)来开发的。我们进一步创建新的变换模型,通过在PLMS(BO-C) 特别模型的顶部(BioC-TradingC-Trading Chardual travidual travidual tradual dual dustral dustrational dustral dustral dustral duction treval treval duction) laudal lady d dal trevaldaldaldal duction duction duction duction duction duction duction lading ladingd disaldaltradingdaltrading lady ladingddaldald liction lady liction liction ladydaldaldddaldal liction ladal ladal ladddal ladal ladddddddal laddddddddddddddddaldaldaldddaldaldaldaldaldal lidal lid lidaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal