Our objective in this study is to investigate the behavior of Boolean operators on combining annotation output from multiple Natural Language Processing (NLP) systems across multiple corpora and to assess how filtering by aggregation of Unified Medical Language System (UMLS) Metathesaurus concepts affects system performance for Named Entity Recognition (NER) of UMLS concepts. We used three corpora annotated for UMLS concepts: 2010 i2b2 VA challenge set (31,161 annotations), Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) corpus (17,457 annotations including UMLS concept unique identifiers), and Fairview Health Services corpus (44,530 annotations). Our results showed that for UMLS concept matching, Boolean ensembling of the MiPACQ corpus trended towards higher performance over individual systems. Use of an approximate grid-search can help optimize the precision-recall tradeoff and can provide a set of heuristics for choosing an optimal set of ensembles.
翻译:我们的研究目标是调查布林操作员将多种自然语言处理系统(NLP)的批注产出合并到多个公司的行为,并评估通过合并统一医疗语言系统(UMLS)的“元词库”概念进行过滤如何影响UMLS概念命名实体识别(NER)的系统性能。我们使用三个附加说明的“UMLS”概念公司:2010 i2b2 VA挑战集(31,161说明)、多源综合平台(MIPACQ)系统(17,457说明,包括UMLS概念独特的识别符号)和“美景健康服务集”(44,530说明),我们的结果显示,UMLS概念匹配的“布利安组合”系统概念,倾向于提高单个系统的性能。使用近似网能研究有助于优化精确召价交易,并为选择一套最佳组合提供一套超理论。