病理学 - 预培训Vs. 新的病理学域变换语言模式 (PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · 文本数据挖掘 · MINE · Performer ·

2022 年 5 月 13 日

PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain

翻译：病理学 - 预培训Vs. 新的病理学域变换语言模式

Thiago Santos,Amara Tariq,Susmita Das,Kavyasree Vayalpati,Geoffrey H. Smith,Hari Trivedi,Imon Banerjee

from arxiv, submitted to "American Medical Informatics Association (AMIA)" 2022 Annual Symposium

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

翻译：鉴于报告的差异性以及癌症亚型定义中不断出现的新发现,病理学文本挖掘是一项艰巨的任务,然而,大型病理学数据库的成功文本挖掘可发挥关键作用,推进“大数据”癌症研究,如类似治疗选择、案例识别、预知、监测、临床试验筛选、风险分层等。虽然人们越来越有兴趣为更具体的临床领域开发语言模型,但是没有病理学特定语言的空间来支持病理学空间快速数据挖掘发展。在文献中,少数方法在专业公司上对一般变压器模型进行微调,同时保持原代号,但在需要专门术语的领域,这些模型往往不能充分发挥作用。我们建议病理学BERT——一个经过事先训练的隐形语言模型,该模型在347、173个病理学样本报告上接受培训,并在Huggingface储存库公开发布。我们的全面实验表明,对病理学子学变动模型进行预先培训,使自然语言理解和乳腺癌诊断系统分类与非特定语言模型相比,能够提高业绩。

0

相关内容

语言模型化

语言模型化

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

中国田鼠亚科 Microtini族(Rodentia: Cricetidae: Arvicolinae)的分类与系统发育研究

国家自然科学基金

0+阅读 · 2014年12月31日

双氰胺基离子液体低温电解制备稀土金属镧的基础研究

国家自然科学基金

0+阅读 · 2014年12月31日

CD106阳性胎盘间充质干细胞通过ESE-3促进血管新生的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

标签共享子空间多源迁移学习方法及在雷达辐射源识别中的研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国霸王属植物的适应性进化研究

国家自然科学基金

0+阅读 · 2013年12月31日

镁-铝-稀土合金中Al-RE金属间相稳定性及其对高温蠕变行为的影响研究

国家自然科学基金

0+阅读 · 2013年12月31日

中国短翅蛩螽的修订

国家自然科学基金

0+阅读 · 2012年12月31日

从肿瘤相关炎症角度探讨STAT5和COX-2在结直肠癌中的作用机制及相互关系

国家自然科学基金

0+阅读 · 2011年12月31日

中国西部蔷薇科果树煤污病菌分类与系统发育研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于σ#960;能量分解的平面多配位原子稳定性及配位性的研究

国家自然科学基金

0+阅读 · 2008年12月31日

Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans

Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans

Arxiv

0+阅读 · 2022年7月4日

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Arxiv

0+阅读 · 2022年7月1日

End-to-end Learning for Image-based Detection of Molecular Alterations in Digital Pathology

Arxiv

0+阅读 · 2022年6月30日

A Roadmap for Big Model

Arxiv

76+阅读 · 2022年3月26日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Survey: Transformer based Video-Language Pre-training

Arxiv

20+阅读 · 2021年9月21日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

K-BERT: Enabling Language Representation with Knowledge Graph

K-BERT: Enabling Language Representation with Knowledge Graph

Arxiv

19+阅读 · 2019年9月17日

Explainable Reasoning over Knowledge Graphs for Recommendation

Arxiv

11+阅读 · 2018年11月12日

VIP会员

文章信息

相关主题

语言模型化

文本数据挖掘

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《美国海军陆战队软件定义网络应用案例：分布式防火墙自动化系统》148页

《多体环境下定位导航授时（PNT）系统研究》228页

软件定义无线电（SDR）：商业与军事领域的技术、应用及未来趋势

《攻势防空作战中无人追击者/规避者最优轨迹研究（含动态交战区建模）》95页

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans

Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans

Arxiv

0+阅读 · 2022年7月4日

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Arxiv

0+阅读 · 2022年7月1日

End-to-end Learning for Image-based Detection of Molecular Alterations in Digital Pathology

Arxiv

0+阅读 · 2022年6月30日

A Roadmap for Big Model

Arxiv

76+阅读 · 2022年3月26日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Survey: Transformer based Video-Language Pre-training

Arxiv

20+阅读 · 2021年9月21日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

K-BERT: Enabling Language Representation with Knowledge Graph

K-BERT: Enabling Language Representation with Knowledge Graph

Arxiv

19+阅读 · 2019年9月17日

Explainable Reasoning over Knowledge Graphs for Recommendation

Arxiv

11+阅读 · 2018年11月12日

相关基金

中国田鼠亚科 Microtini族(Rodentia: Cricetidae: Arvicolinae)的分类与系统发育研究

国家自然科学基金

0+阅读 · 2014年12月31日

双氰胺基离子液体低温电解制备稀土金属镧的基础研究

国家自然科学基金

0+阅读 · 2014年12月31日

CD106阳性胎盘间充质干细胞通过ESE-3促进血管新生的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

标签共享子空间多源迁移学习方法及在雷达辐射源识别中的研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国霸王属植物的适应性进化研究

国家自然科学基金

0+阅读 · 2013年12月31日

镁-铝-稀土合金中Al-RE金属间相稳定性及其对高温蠕变行为的影响研究

国家自然科学基金

0+阅读 · 2013年12月31日

中国短翅蛩螽的修订

国家自然科学基金

0+阅读 · 2012年12月31日

从肿瘤相关炎症角度探讨STAT5和COX-2在结直肠癌中的作用机制及相互关系

国家自然科学基金

0+阅读 · 2011年12月31日

中国西部蔷薇科果树煤污病菌分类与系统发育研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于σ#960;能量分解的平面多配位原子稳定性及配位性的研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员