COVID-19的可解释机器学习:关于严重预测任务的经验研究 (Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task)

The black-box nature of machine learning models hinders the deployment of some high-accuracy models in medical diagnosis. It is risky to put one's life in the hands of models that medical researchers do not fully understand. However, through model interpretation, black-box models can promptly reveal significant biomarkers that medical practitioners may have overlooked due to the surge of infected patients in the COVID-19 pandemic. This research leverages a database of 92 patients with confirmed SARS-CoV-2 laboratory tests between 18th Jan. 2020 and 5th Mar. 2020, in Zhuhai, China, to identify biomarkers indicative of severity prediction. Through the interpretation of four machine learning models, decision tree, random forests, gradient boosted trees, and neural networks using permutation feature importance, Partial Dependence Plot (PDP), Individual Conditional Expectation (ICE), Accumulated Local Effects (ALE), Local Interpretable Model-agnostic Explanations (LIME), and Shapley Additive Explanation (SHAP), we identify an increase in N-Terminal pro-Brain Natriuretic Peptide (NTproBNP), C-Reaction Protein (CRP), and lactic dehydrogenase (LDH), a decrease in lymphocyte (LYM) is associated with severe infection and an increased risk of death, which is consistent with recent medical research on COVID-19 and other research using dedicated models. We further validate our methods on a large open dataset with 5644 confirmed patients from the Hospital Israelita Albert Einstein, at S\~ao Paulo, Brazil from Kaggle, and unveil leukocytes, eosinophils, and platelets as three indicative biomarkers for COVID-19.

翻译：机器学习模型的黑箱性质阻碍了医学诊断中某些高精度模型的部署; 将一个人的生命置于医学研究人员不完全理解的模型手中是危险的; 但是,通过模型解释,黑箱模型能够迅速揭示出由于COVID-19大流行中感染病人激增,医生可能忽略的重大生物标志; 这项研究利用了一个数据库,数据库中92名病人的SARS-COV-2实验室测试在2020年1月18日至2020年3月5日期间在中国Zhuhai进行; 将一个人的生命置于医学研究人员不完全理解的模型的手中,以识别表明严重程度预测的生物标志。通过对四种机器学习模型、决定树、随机森林、梯度增强的树和神经网络的解释,利用变异性特征的重要性,部分Depidence Plot(PDP)、个人感知性预期(ICE)、累积的地方效应(ALE)、地方诊断模型解释(LME),以及Shaply Addivil 解释(SHADRA) 进一步在NEO-S-DOD上增加了数据, 并且不断使用SDRIDRILIND 和ILNTADOUDAD 数据。