Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processsing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom designed datasets to address NLP tasks in supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as lack of task-matching datasets as well as task-specific pre-trained models. In our work we suggest to leverage pretrained language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at: https://github.com/frankkramer-lab/GPTNERMED
翻译:获取带有语义说明的文本数据集是一个努力的过程,但对于自然语言处理(NLP)的监管培训至关重要。 一般来说,开发和应用针对特定领域任务的新的NLP管道往往需要定制设计的数据集,以便以监督的机器学习方式处理NLP的任务。 当用非英语语言进行医疗数据处理时,这暴露出若干次要和重要的相互关联的问题,例如缺乏任务匹配数据集和特定任务预先培训的模式。 在我们的工作中,我们建议利用预先培训的语言模型来培训数据获取,以便获取足够大的数据数据集,用于培训更小、更高效的使用个案特定任务模式。为了展示你的方法的有效性,我们创建了一套定制数据集,用于为德国文本培训医疗净化模型,GPTNERMED,但我们的方法原则上仍然依赖语言。我们获得的数据集以及我们预先培训的模式公布在https://github.com/frankkrammer-lab/GPTNERMED。