The current state of adoption of well-structured electronic health records and integration of digital methods for storing medical patient data in structured formats can often considered as inferior compared to the use of traditional, unstructured text based patient data documentation. Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. In natural language processing (NLP), statistical models have been shown successful in various tasks like part-of-speech tagging, relation extraction (RE) and named entity recognition (NER). In this work, we present GERNERMED, the first open, neural NLP model for NER tasks dedicated to detect medical entity types in German text data. Here, we avoid the conflicting goals of protection of sensitive patient data from training data extraction and the publication of the statistical model weights by training our model on a custom dataset that was translated from publicly available datasets in foreign language by a pretrained neural machine translation model. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED
翻译:目前采用结构完善的电子健康记录和整合以结构化格式储存病人数据的数字方法的状况往往被认为低于使用传统的、无结构化文本的病人数据文件,医学数据分析领域的数据挖掘往往需要仅仅依靠处理非结构化数据来检索有关数据。在自然语言处理(NLP)中,统计模型在诸如部分语音标记、关系提取(RE)和名称实体识别(NER)等各种任务中被证明是成功的。在这项工作中,我们提出了GERNERMED,这是用于探测德国文本数据中医疗实体类型的第一个开放、神经NLP型NERMED模型。在这里,我们避免了在培训数据提取和公布统计模型权重方面保护敏感的病人数据这一相互矛盾的目标,我们通过培训模型,通过事先经过训练的神经机器翻译模型,从公开提供的外语数据集翻译。样本代码和统计模型见:https://github.com/frankkramer-lab/GERMED。