While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.
翻译:虽然我们注意到临床自然语言处理(NLP)的最新进展,但我们注意到临床和翻译研究界由于透明度、可解释性和可用性有限,在采用NLP模型方面存在一些阻力。在本研究中,我们提议了一个开放的自然语言处理发展框架。我们通过为国家COVID Cohort合作(N3C)实施NLP算法进行了评估。根据对从COVID-19相关临床说明提取信息的兴趣,我们的工作包括:1)一个公开的数据说明过程,使用COVID-19的迹象和症状作为使用的例子,2个社区驱动的规则平台,3个合成文本数据生成流程,为信息提取任务生成文本,而不涉及人类主体。Corsoora是从三个不同机构(Mayo Clinic,肯塔基大学,明尼苏达尼苏达大学)的文本中衍生出来的。黄金标准说明经过了单一机构(Mayo)规则的测试。这导致使用0.876、0.706和0.69的F-scricrets用于Mayo,明尼苏达尼苏达州和肯塔州测试数据库数据库数据库数据库数据库的模型,这研究分别是我们将N-C发展案例分组的一项研究。