Pre-trained language models have been the de facto paradigm for most natural language processing (NLP) tasks. In the biomedical domain, which also benefits from NLP techniques, various pre-trained language models were proposed by leveraging domain datasets including biomedical literature, biomedical social medial, electronic health records, and other biological sequences. Large amounts of efforts have been explored on applying these biomedical pre-trained language models to downstream biomedical tasks, from informatics, medicine, and computer science (CS) communities. However, it seems that the vast majority of existing works are isolated from each other probably because of the cross-discipline characteristics. It is expected to propose a survey that not only systematically reviews recent advances of biomedical pre-trained language models and their applications but also standardizes terminology, taxonomy, and benchmarks. Therefore, this paper summarizes the recent progress of pre-trained language models used in the biomedical domain. Particularly, an overview and taxonomy of existing biomedical pre-trained language models as well as their applications in biomedical downstream tasks are exhaustively discussed. At last, we illustrate various limitations and future trends, which we hope can provide inspiration for the future research.
翻译:未经培训的语言模式是大多数自然语言处理任务的实际范例,在生物医学领域,也得益于非自然语言处理技术,通过利用包括生物医学文献、生物医学社会媒介、电子健康记录和其他生物序列在内的域域数据集,提出了各种经过培训的语文模式,因此,本文件总结了生物医学领域采用的经过培训的语文模式的最新进展,特别是详尽地讨论了从信息学、医学和计算机科学(CS)社区到下游生物医学任务,但现有绝大多数工作似乎由于跨学科特点而相互孤立,我们最后要说明各种限制和未来趋势,我们希望这些限制和未来趋势能为今后的研究提供灵感。