Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons.
翻译:医学编码是将医疗文本记录分配给医学代码的任务。医疗保健专业人员手动分配这些代码以跟踪患者的诊断和治疗。自动化医学编码可以显着减轻这种行政负担。在本文中,我们复制、比较和分析了最新的自动化医学编码机器学习模型。我们表明,由于配置弱、训练测试分割不充分以及评估不足,几个模型表现不佳。在以前的工作中,宏 F1 分数的计算被进行了亚优化,而我们的更正将其加倍了。我们通过分层抽样和相同的实验设置,包括超参数和决策边界调整,做出了修订后的模型比较。我们分析预测误差以验证和证伪以前工作的假设。分析确认所有模型都面临稀有代码的困扰,而长文档仅有微不足道的影响。最后,我们展示了使用复制的模型对新发布的 MIMIC-IV 数据集的首个全面结果。我们发布了我们的代码、模型参数和新的 MIMIC-III 和 MIMIC-IV 训练和评估管道,以适应未来公平的比较。