Healthcare providers usually record detailed notes of the clinical care delivered to each patient for clinical, research, and billing purposes. Due to the unstructured nature of these narratives, providers employ dedicated staff to assign diagnostic codes to patients' diagnoses using the International Classification of Diseases (ICD) coding system. This manual process is not only time-consuming but also costly and error-prone. Prior work demonstrated potential utility of Machine Learning (ML) methodology in automating this process, but it has relied on large quantities of manually labeled data to train the models. Additionally, diagnostic coding systems evolve with time, which makes traditional supervised learning strategies unable to generalize beyond local applications. In this work, we introduce a general weakly-supervised text classification framework that learns from class-label descriptions only, without the need to use any human-labeled documents. It leverages the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to individual texts. We demonstrate the efficacy and flexibility of our method by comparing it to state-of-the-art weak text classifiers across four real-world text classification datasets, in addition to assigning ICD codes to medical notes in the publicly available MIMIC-III database.
翻译:保健提供者通常为临床、研究和计费目的记录向每个病人提供的临床护理的详细笔记。由于这些叙述没有结构化的性质,提供者雇用专职工作人员使用国际疾病分类编码系统为病人的诊断指定诊断代码。这一人工过程不仅耗时,而且费用高,容易出错。以前的工作证明机械学习方法在使这一过程自动化方面的潜在效用,但是它依靠大量人工标签数据来训练模型。此外,诊断编码系统随着时间而演变,使得传统的受监督学习战略无法超越当地应用范围加以概括。在这项工作中,我们采用一般的、薄弱的、受监督的文本分类框架,仅从类标签说明中学习,而无需使用任何人类标签文件。它利用预先培训的语言模式和数据编程框架储存的语言域知识来为个别文本指定代码标签。我们通过将这种方法与四个现实世界文本分类数据库中最薄弱的医学分类系统比较,显示了我们的方法的有效性和灵活性。在将ICD分类数据库中的I-CD数据配置中,还把I-CD-MI数据库中的I-MI-MI-D。