Loan risk for small businesses has long been a complex problem worthy of exploring. Predicting the loan risk can benefit entrepreneurship by developing more jobs for the society. CatBoost (Categorical Boosting) is a powerful machine learning algorithm suitable for dataset with many categorical variables like the dataset for forecasting loan risk. In this paper, we identify the important risk factors that contribute to loan status classification problem. Then we compare the performance between boosting-type algorithms(especially CatBoost) with other traditional yet popular ones. The dataset we adopt in the research comes from the U.S. Small Business Administration (SBA) and holds a very large sample size (899,164 observations and 27 features). In order to make the best use of the important features in the dataset, we propose a technique named "synthetic generation" to develop more combined features based on arithmetic operation, which ends up improving the accuracy and AUC of the original CatBoost model. We obtain a high accuracy of 95.84% and well-performed AUC of 98.80% compared with the existent literature of related research.
翻译:对小企业的贷款风险长期以来一直是值得探讨的一个复杂问题。预测贷款风险可以通过为社会创造更多就业机会而使创业受益。 Catboost(Catboost)是一个强大的机器学习算法,适合用诸如预测贷款风险的数据集等许多绝对变量建立数据集。在本文中,我们确定了导致贷款地位分类问题的重要风险因素。然后我们将刺激型算法(特别是CatBoost)与其他传统和流行型算法的性能进行比较。我们在研究中采用的数据集来自美国小企业管理局(SAB),具有非常庞大的样本规模(899,164个观察和27个特征 ) 。为了最好地利用数据集中的重要特征,我们提出了一个名为“合成一代”的技术,以根据算术操作开发更多综合特征,从而最终提高原CatBoost模型的准确性和AUC。我们获得了95.84%的高精度,而完善的ACC为98.80%,与相关研究的现有文献相比,我们获得了98.80%的精度。