This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.
翻译:本文旨在探索基于极端梯度推升(XGBoost)方法的模型,用于商业风险分类。模型培训同时考虑特征选择(FS)算法和超参数优化。三种最常用的FS方法,包括基尼重量、Chi-square重量、等级变量组合、相关重量和信息重量,都用于减轻冗余特征的影响。两个超参数优化方法,随机搜索(RS)和巴伊西亚树结构型Parzenoo Estimator(TPE),在XGBoost中应用。不同的FS和超参数优化方法对模型性能的影响由威尔科森签署Rank测试调查。XGBost的性能与传统使用的后勤回归模型相比,在分类精确度方面,在曲线下区域(AUC),回顾,和从10倍交叉验证中获得的F1分数。结果显示,等级组合组合是FS方法的最佳方法,而Chi-squread 则在XGB1 级中取得最佳性业绩。TBOost,TPE 和RSerral 最精确性调整方法在XGB 上显示一个显著的底压。