Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. How- ever, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora
翻译:许多先前的语言建模工作表明,就一个内地生物医学公司规模不同的变形体的双向编码代表进行了一系列的试验。结果显示,就相对少量、培训步骤有限的内地数据(4GB)进行预先培训,与一般公司预先培训的微调模型相比,与一般公司培训的微调模型相比,在收集足够的内地数据方面的困难如何会阻碍研究人员完成这一培训前任务。