This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal "best" method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.
翻译:本文比较了各种数据处理方法在结构化数据预测性业绩方面的绩效。本文件还力求确定并建议基于树的二进制分类模型的预处理方法,重点是树基二进制模型,重点是eXBoost(XGBoost)模型;构建了三套各种结构、互动和复杂性的数据集,由贷款俱乐部的一套真实世界数据加以补充;我们比较了地物选择、绝对处理和无效估算的若干方法;利用所选方法之间的相对比较,包括模型预测变异性,评估了绩效。本文由三组预处理方法提出,每一部分由普遍观察组成。每项观察都附有一种或更多首选方法的建议。在特征选择方法、基于异性特征的重要性、正规化和XGBoost特征的重量重要性方面没有建议。相关系数的降低也表明性能的劣等。相反,XGBOst重要性通过增益模型显示最一致和最高性的性格。计算方法显示数据集的性能差异更大。尽管最普遍的“最优性能”方法是“最差的“最差性能”,但最精确的性能指标显示最精确性能指标。</s>