Estimating the importance of variables is an essential task in modern machine learning. This help to evaluate the goodness of a feature in a given model. Several techniques for estimating the importance of variables have been developed during the last decade. In this paper, we proposed a computational and theoretical exploration of the emerging methods of variable importance estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine (SVM), the Predictive Error Function (PERF), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on different kinds of real-life and simulated data. All these methods can handle both regression and classification tasks seamlessly but all fail when it comes to dealing with data containing missing values. The implementation has shown that PERF has the best performance in the case of highly correlated data closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had the worst performance on small data sizes but they are the fastest when it comes to the execution time. SVM is the most appropriate when many redundant features are in the dataset. A surplus with the PERF is its natural cut-off at zero helping to separate positive and negative scores with all positive scores indicating essential and significant features while the negatives score indicates useless features. RF and LASSO are very versatile in a way that they can be used in almost all situations despite they are not giving the best results.
翻译:估计变量的重要性是现代机器学习中的一项基本任务。 这有助于评估特定模型中某个特征的优劣性。 过去十年中开发了几种估算变量重要性的技术。 在本文中, 我们提议对新出现的不同重要性估计方法进行计算和理论探索, 即: 最不绝对缩小和选择操作员(LASSO)、 支持矢量机(SVM)、 预测错误函数(PERF)、 随机森林(Random Forest) 和极快加速(XGBOOST) (XGBOST) 。 所有这些方法都可以无缝地处理回归和分类任务,但在处理含有缺失值的数据时,所有这些方法都失败了。 执行结果表明,在数据设置中,最密切相关的数据功能是“ 绝对缩小” (PERF) 和 XGBOOOST是“数据饥饿” 方法, 其性能最差的功能是小数据大小, 但是当时间到执行时,它们表现得最快。 当许多冗余的功能在数据设置中几乎是无用的, 但处理缺损性成绩的成绩特征时, 而自然分分分数显示, 。 顺差与分数与分数(PERF) 表示为正分数是最坏的分数是最坏的。