Abbreviations are unavoidable yet critical parts of the medical text. Using abbreviations, especially in clinical patient notes, can save time and space, protect sensitive information, and help avoid repetitions. However, most abbreviations might have multiple senses, and the lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task. The main objective of this study is to examine the feasibility of token classification methods for medical abbreviation disambiguation. Specifically, we explore the capability of token classification methods to deal with multiple unique abbreviations in a single text. We use two public datasets to compare and contrast the performance of several transformer models pre-trained on different scientific and medical corpora. Our proposed token classification approach outperforms the more commonly used text classification models for the abbreviation disambiguation task. In particular, the SciBERT model shows a strong performance for both token and text classification tasks over the two considered datasets. Furthermore, we find that abbreviation disambiguation performance for the text classification models becomes comparable to that of token classification only when postprocessing is applied to their predictions, which involves filtering possible labels for an abbreviation based on the training data.
翻译:缩略语是医学文本不可避免的关键部分。 使用缩略语,特别是临床病人笔记中的缩略语,可以节省时间和空间,保护敏感信息,有助于避免重复。 但是,大多数缩略语可能具有多重感知,缺乏标准化的绘图系统使得模糊缩略语是一项困难和耗时的任务。 本研究的主要目的是研究医学缩略语脱节的象征性分类方法的可行性。 具体地说, 我们探索象征性分类方法在单一文本中处理多个独特缩略语的能力。 我们使用两个公共数据集来比较和比较在不同的科学和医学公司中预先培训的若干变压器模型的性能。 我们提议的象征性分类方法比起更常用的缩略略语脱节任务文本分类模型。 特别是, SciBERT 模型显示两个考虑的数据集的象征性和文本分类工作都有很强的性能。 此外,我们发现, 文本分类模型的缩略微脱混和性能与象征性分类的性能相比,只有当后期处理用于预测时,我们才发现它们才使用这些变相,这需要过滤基于培训数据的缩略数据的可能标签。