Identifying spans in medical texts that correspond to medical entities is one of the core steps for many healthcare NLP tasks such as ICD coding, medical finding extraction, medical note contextualization, to name a few. Existing entity extraction methods rely on a fixed and limited vocabulary of medical entities and have difficulty with extracting entities represented by disjoint spans. In this paper, we present a new transformer-based architecture called OSLAT, Open Set Label Attention Transformer, that addresses many of the limitations of the previous methods. Our approach uses the label-attention mechanism to implicitly learn spans associated with entities of interest. These entities can be provided as free text, including entities not seen during OSLAT's training, and the model can extract spans even when they are disjoint. To test the generalizability of our method, we train two separate models on two different datasets, which have very low entity overlap: (1) a public discharge notes dataset from hNLP, and (2) a much more challenging proprietary patient text dataset "Reasons for Encounter" (RFE). We find that OSLAT models trained on either dataset outperform rule-based and fuzzy string matching baselines when applied to the RFE dataset as well as to the portion of hNLP dataset where entities are represented by disjoint spans. Our code can be found at https://github.com/curai/curai-research/tree/main/OSLAT.
翻译:与医疗实体相对应的医疗文本的识别范围是许多医疗保健NLP任务的核心步骤之一,如 ICD 编码、 医学发现提取、 医学注释背景化等。 现有的实体提取方法依赖于医疗实体的固定和有限的词汇表, 并且难以提取由脱节范围所代表的实体。 在本文中, 我们展示了一个新的基于变压器的架构, 名为 OSLAT, 开放Set Label 注意变换器, 解决了先前方法的许多局限性。 我们的方法使用标签搜索机制来隐含地学习与相关实体相关的内容。 这些实体可以作为免费文本提供, 包括OSLAT培训期间没有看到的实体, 而模型即使脱节时也可以提取。 为了测试我们的方法的可概括性, 我们在两个不同的数据集中分别设置了两个不同的模型, 这些数据集存在非常低的实体重叠:(1) 一个来自 hNLP 的公共排放说明数据集, 以及 (2) 一个更具挑战性的病人文本数据集“ Econtracur( RFE) 。 我们发现, OSLATRAT 模型是用来将数据作为模板的模板, 用于将我们的数据比标准中的数据比标准的模板比标准, 。