Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedure using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infer ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F1 30.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.
翻译:自动国际疾病分类( ICD) 编码的目的是为一份医疗说明指定多种 ICD 代码, 平均为 3 000+ 符号。 由于多标签任务( 155 000+ ICD 代码候选人) 的高维空间( 多标签任务( 155 000+ ICD 代码候选人) 和长尾挑战 -- -- 许多 ICD 代码不经常分配, 但不常见 ICD 代码在临床上很重要。 本研究通过将这个多标签分类任务转化为自动递增的一代任务来应对长尾挑战。 具体地说, 我们首先引入一个新的培训前新目标, 利用 SOAP 结构, 医疗逻辑医生用于备注文件, 产生免费文本诊断和程序。 第二, 我们的模型不是直接预测多标签任务( 15 000+ ICD 代码候选人) 的高维度空间( 15 000+ ICD 代码候选人), 而是长尾挑战 - 许多 ICD 代码 。 第三, 我们为多标签分类设计了一个新的快速模板。 我们用快速模型来评估我们的新一代模型, 在所有代码任务基准( MIMIMI- III III) 和 FSO- IMA IMA 上分别从 mess FSO- 302, 和 FSO- mess 和 FSO- IMA IMO- mill mill mill mill mess 4, 我们的预 、 IM IM IMO- fl 和 FSO- fl 4, IMFSO- sl 4, IMFS- sal- sl 4, 我们 和 FS- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- s m- sal- sem- sal- sem- sal- sal- sal- sal- sem- sal- sem- sal- sal- sem- fal- fal- fal- sem- sem- sem- s 、 、 、 、 、 、 、 、 、 、 、 、 、