Applying tree-based embedded feature selection to exclude irrelevant features in high-dimensional data with very small sample sizes requires optimized hyperparameters for the model building process. In addition, nested cross-validation must be applied for this type of data to avoid biased model performance. The resulting long computation time can be accelerated with pruning. However, standard pruning algorithms must prune late or risk aborting calculations of promising hyperparameter sets due to high variance in the performance evaluation metric. To address this, we adapt the usage of a state-of-the-art successive halving pruner and combine it with two new pruning strategies based on domain or prior knowledge. One additional pruning strategy immediately stops the computation of trials with semantically meaningless results for the selected hyperparameter combinations. The other is an extrapolating threshold pruning strategy suitable for nested-cross-validation with high variance. Our proposed combined three-layer pruner keeps promising trials while reducing the number of models to be built by up to 81,3% compared to using a state-of-the-art asynchronous successive halving pruner alone. Our three-layer pruner implementation(available at https://github.com/sigrun-may/cv-pruner) speeds up data analysis or enables deeper hyperparameter search within the same computation time. It consequently saves time, money and energy, reducing the CO2 footprint.
翻译:应用基于树的嵌入式特征选择来排除具有非常小样本尺寸的高维数据中不相干特性的不相干特性,这要求模型构建过程采用优化的超参数。此外,对于这种类型的数据必须采用嵌套交叉校验法,以避免偏差模型性能。由此而来的长期计算时间可以通过修剪速度加快。然而,标准的修剪算算法必须缩短或冒风险中止有前途的超参数的计算,因为业绩评估衡量标准差异很大。为了解决这个问题,我们调整了使用最先进的连续不断将精细小的理算器,并将它与基于域或先前知识的两种新的理算战略结合起来。另外一项修补战略将立即停止计算试验,同时对选定的超单数参数组合进行测算。另外一项是适用于高差异的嵌入交叉校正的外加压阈值调整策略。我们提议的三层理算仪继续有希望的试验,同时将模型的数量降低到81,而将使用最先进的时间缩缩缩略度计算/更深的轨道进行。