Automatically classifying electronic health records (EHRs) into diagnostic codes has been challenging to the NLP community. State-of-the-art methods treated this problem as a multilabel classification problem and proposed various architectures to model this problem. However, these systems did not leverage the superb performance of pretrained language models, which achieved superb performance on natural language understanding tasks. Prior work has shown that pretrained language models underperformed on this task with the regular finetuning scheme. Therefore, this paper aims at analyzing the causes of the underperformance and developing a framework for automatic ICD coding with pretrained language models. We spotted three main issues through the experiments: 1) large label space, 2) long input sequences, and 3) domain mismatch between pretraining and fine-tuning. We propose PLMICD, a framework that tackles the challenges with various strategies. The experimental results show that our proposed framework can overcome the challenges and achieves state-of-the-art performance in terms of multiple metrics on the benchmark MIMIC data. The source code is available at https://github.com/MiuLab/PLM-ICD
翻译:将电子健康记录自动分类为诊断代码对全国语言方案社区来说一直具有挑战性。最先进的方法将这一问题作为多标签分类问题处理,并提出各种结构来模拟这一问题。然而,这些系统并没有利用预先培训的语言模型的超能力,这些模型在自然语言理解任务上取得了超能力。先前的工作表明,通过定期微调计划,预先培训的语言模型在这项任务上表现不佳。因此,本文件旨在分析表现不佳的原因,并制定一个以预先培训的语言模型自动进行ICD编码的框架。我们通过实验发现了三个主要问题:1)大标签空间,2长输入序列和3)预先培训和微调之间的领域不匹配。我们提出了PLMICD,这是一个以各种战略应对挑战的框架。实验结果表明,我们提议的框架能够克服挑战,并在衡量MIMIM数据的多种指标方面实现最先进的业绩。源代码见https://github.com/MIuLab/PLM-ICD。