While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, Interpretability and usability. Built upon our previous work, in this study, we proposed an open natural language processing development framework and evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The generated corpora derived out of the texts from multiple intuitions and gold standard annotation are tested on a single institution's rule set has the performances in F1 score of 0.876, 0.706 and 0.694, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study.
翻译:虽然我们注意到临床自然语言处理(NLP)的最新进展,但我们可以注意到临床和翻译研究界由于透明度、可解释性和可用性有限而在采用NLP模型方面存在一些阻力。我们以先前的工作为基础,在本研究中提出了开放自然语言处理发展框架,并通过实施全国COVID Cohort Coople Comprove(N3C)的NLP算法对该框架进行了评估。根据从COVID-19相关临床笔记中提取信息的兴趣,我们的工作包括:1)使用COVID-19迹象和症状作为使用案例的公开数据说明过程;2)社区驱动的规则设置平台;3)综合文本数据生成工作流程,为信息提取任务生成文本,而不涉及人类课题。从多直觉和黄金标准注释中生成的文本,在单一机构规则集中测试了F1分为0.876、0.706和0.694的绩效。N3CNLP分组的联合研究展示了创建化临床基准平台的可行性。