This paper presents medBERT.de, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERT.de are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.
翻译:本文介绍了德国医学领域专门设计的一个经过预先培训的德国医学领域BERT模型。该模型经过培训,拥有4 700万份德国医疗文件的大量内容,并表明在涵盖广泛的学科和医疗文件类型的八种不同的医疗基准上取得了新的最新业绩。除了评价模型的总体业绩外,本文件还对其能力进行了更深入的分析。我们调查了数据脱钩对模型性能的影响,以及使用更有效的代号方法的潜在好处。我们的结果表明,诸如MedBERT.de等具体领域的模型对较长的文本特别有用,培训数据的消减不一定导致业绩的改善。此外,我们认为,高效代号化在改进模型性能方面作用不大,并将大部分改进的性能归功于大量的培训数据。为了鼓励进一步的研究,将预先培训的模型重量和基于放射数据的新基准公开提供给科学界使用。</s>