This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.
翻译:本文介绍medBERTde--一种专为德国医学领域所设计的预训练BERT模型。该模型基于470万个德文医学文档进行训练,取得了8个医学领域的新的表现最佳效果。我们此外还对该模型的能力进行了更深入的分析。本文研究了数据去重对模型性能的影响,以及使用更高效的分词方法可能带来的潜在好处。我们的研究结果表明,像medBERTde这样的专业模型特别适用于较长的文本,而训练数据的去重并不一定会导致性能提高。此外,我们发现高效的分词对于改善模型性能只起到了较小的作用,而大量训练数据是性能提高的主要原因。为了推动进一步研究,我们公开了预训练模型权重,还提供了基于放射学数据的新基准数据,供科学界使用。