MEDBERT.de:德国医学领域综合BERT模型</s> (MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain)

Keno K. Bressem,Jens-Michalis Papaioannou,Paul Grundmann,Florian Borchert,Lisa C. Adams,Leonhard Liu,Felix Busch,Lina Xu,Jan P. Loyen,Stefan M. Niehues,Moritz Augustin,Lennart Grosser,Marcus R. Makowski,Hugo JWL. Aerts,Alexander Löser

from arxiv, Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann contributed equally

This paper presents medBERT.de, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERT.de are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.

翻译：本文介绍了德国医学领域专门设计的一个经过预先培训的德国医学领域BERT模型。该模型经过培训,拥有4 700万份德国医疗文件的大量内容,并表明在涵盖广泛的学科和医疗文件类型的八种不同的医疗基准上取得了新的最新业绩。除了评价模型的总体业绩外,本文件还对其能力进行了更深入的分析。我们调查了数据脱钩对模型性能的影响,以及使用更有效的代号方法的潜在好处。我们的结果表明,诸如MedBERT.de等具体领域的模型对较长的文本特别有用,培训数据的消减不一定导致业绩的改善。此外,我们认为,高效代号化在改进模型性能方面作用不大,并将大部分改进的性能归功于大量的培训数据。为了鼓励进一步的研究,将预先培训的模型重量和基于放射数据的新基准公开提供给科学界使用。</s>

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日