Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.
翻译:鉴于报告的差异性以及癌症亚型定义中不断出现的新发现,病理学文本挖掘是一项艰巨的任务,然而,大型病理学数据库的成功文本挖掘可发挥关键作用,推进“大数据”癌症研究,如类似治疗选择、案例识别、预知、监测、临床试验筛选、风险分层等。虽然人们越来越有兴趣为更具体的临床领域开发语言模型,但是没有病理学特定语言的空间来支持病理学空间快速数据挖掘发展。在文献中,少数方法在专业公司上对一般变压器模型进行微调,同时保持原代号,但在需要专门术语的领域,这些模型往往不能充分发挥作用。我们建议病理学BERT——一个经过事先训练的隐形语言模型,该模型在347、173个病理学样本报告上接受培训,并在Huggingface储存库公开发布。我们的全面实验表明,对病理学子学变动模型进行预先培训,使自然语言理解和乳腺癌诊断系统分类与非特定语言模型相比,能够提高业绩。