Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of existing tools. One of the main reasons is the sensitivity of the data. This problem is even more evident for non-English languages. Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text processing models. The models are trained on a de-identified corpus of German nephrology reports. Result: The presented models provide promising results on in-domain data. Moreover, we show that our models can be also successfully applied to other biomedical text in German. Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
翻译:信息提取和自然语言处理领域:在信息提取和自然语言处理领域,可获取的数据集对于复制和比较结果至关重要。公开可用的实施和工具可以作为基准,促进更复杂的应用的开发。然而,在临床文本处理方面,可获取的数据集数量很少,现有工具的数量也很少。主要原因之一是数据的敏感性。这个问题在非英语语言中更为明显。方法:为了解决这一问题,我们引入了一个工作箱:一个德国临床文本处理模型集。这些模型经过了有关德国肾脏学报告的分辨组合的培训。结果:所展示的模型为内域数据提供了有希望的结果。此外,我们还表明,我们的模型也可以成功地应用于德语中的其他生物医学文本。我们的工作箱可以公开,以便将其作为基准或转移到相关问题中。