This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal ``best'' method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.
翻译:本文比较了各种数据处理方法在结构化数据预测性业绩方面的绩效。本文件还试图确定并建议基于树的二进制分类模型的预处理方法,重点是树基二进制模型,重点是eXtreme Gradient Boostening(XGBoost)模型。构建了三套各种结构、互动和复杂性的数据集,由贷款俱乐部的一套真实世界数据集加以补充。我们比较了地物选择、绝对处理和无效估算的若干方法。评估绩效的方法是采用所选方法之间的相对比较,包括模型预测变异性。本文由三组预处理方法提出,每一部分由普遍观察组成。每项观察都附有一种或更多首选方法的建议。在特征选择方法、基于异性特征的重要性、正规化和XGBoost的特征方面没有建议。相关系数的降低还表明性能的劣等。相反,XGBOS模型通过增加性能的一致性和最高性能的误差。在数据集中显示性能差异更大。尽管最普遍的性能定型方法是最复杂的性性性指标,但最精确性地显示最精确性指标。