生存分析机器学习算法比较方法学 (Methodology for Comparing Machine Learning Algorithms for Survival Analysis)

Lucas Buk Cardoso,Simone Aldrey Angelo,Yasmin Pacheco Gil Bonilha,Fernando Maia,Adeylson Guimarães Ribeiro,Maria Paula Curado,Gisele Aparecida Fernandes,Vanderlei Cunha Parro,Flávio Almeida de Magalhães Cipparrone,Alexandre Dias Porto Chiavegatto Filho,Tatiana Natasha Toporcov

This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of S\~ao Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.

翻译：本研究对六种用于生存分析的机器学习模型（MLSA）进行了比较方法学分析。利用圣保罗医院癌症登记处近45,000名结直肠癌患者的数据，我们评估了能够考虑删失数据预测生存的随机生存森林（RSF）、生存分析梯度提升（GBSA）、生存支持向量机（SSVM）、XGBoost-Cox（XGB-Cox）、XGBoost-AFT（XGB-AFT）和LightGBM（LGBM）。采用不同采样器进行超参数优化，并通过一致性指数（C-Index）、逆概率加权一致性指数（C-Index IPCW）、时间依赖性AUC和综合Brier评分（IBS）评估模型性能。将模型生成的生存曲线与分类算法的预测结果进行比较，并使用SHAP和置换重要性进行预测因子解释。XGB-AFT获得最佳性能（C-Index = 0.7618；IPCW = 0.7532），其次是GBSA和RSF。结果凸显了MLSA在改进生存预测和支持决策制定方面的潜力与适用性。