Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.
翻译:受过训练的语言模式已成为自然语言处理的重要支柱。最近,在野外的预科培训已证明有益于各种具体领域的下游任务。在生物医学领域,自然语言生成任务至关重要,但研究不足。随着自然语言理解(NLU)任务通过受限制的语言生成或语言推动,在一般领域取得令人满意的业绩,我们强调生物医学领域缺乏内部的基因化语言模式和非系统化的基因化下游基准,阻碍了研究界的发展。在这项工作中,我们引入了基因化语言模式BioBART,使BiOBART适应生物医学领域。我们整理了各种生物医学语言生成任务,包括对话、汇总、实体连接和名称实体识别。BiOBART在PubMed摘要上受过培训,提高了与BART相比的绩效,并为若干任务设定了强有力的基准。此外,我们开展了生物BiBART培训前的分类研究,发现判刑对下游任务产生了负面影响。