Transformer-based pretrained language models (PLMs) have started a new era in modern natural language processing (NLP). These models combine the power of transformers, transfer learning, and self-supervised learning (SSL). Following the success of these models in the general domain, the biomedical research community has developed various in-domain PLMs starting from BioBERT to the latest BioMegatron and CoderBERT models. We strongly believe there is a need for a survey paper that can provide a comprehensive survey of various transformer-based biomedical pretrained language models (BPLMs). In this survey, we start with a brief overview of foundational concepts like self-supervised learning, embedding layer and transformer encoder layers. We discuss core concepts of transformer-based PLMs like pretraining methods, pretraining tasks, fine-tuning methods, and various embedding types specific to biomedical domain. We introduce a taxonomy for transformer-based BPLMs and then discuss all the models. We discuss various challenges and present possible solutions. We conclude by highlighting some of the open issues which will drive the research community to further improve transformer-based BPLMs.
翻译:在现代自然语言处理(NLP)中,基于变压器的预先培训语言模型(PLM)已经进入一个新时代。这些模型结合了变压器、转移学习和自我监督学习(SSL)的力量。在这些模型在一般领域取得成功之后,生物医学研究界开发了从BioBERT到最新的BioMetron和CocrBERT模型的各种内部变压器(PLMs)。我们强烈认为需要一份调查文件,全面调查各种基于变压器的生物医学预先培训语言模型(BPLMs),然后讨论所有模型(BPLMS)。在这次调查中,我们首先简要概述一些基础概念,如自我监督学习、嵌入层和变压器编码层。我们讨论了基于变压器的LMs的核心概念,如培训前方法、培训前任务、微调方法,以及生物医学领域特有的各种嵌入类型。我们引入变压器的分类,然后讨论所有模型。我们讨论了各种挑战并提出了可能的解决办法。我们最后强调一些开放的问题,将推动研究界进一步改进以变压器为基础的BPLMs。