Random forests are a widely used machine learning algorithm, but their computational efficiency is undermined when applied to large-scale datasets with numerous instances and useless features. Herein, we propose a nonparametric feature selection algorithm that incorporates random forests and deep neural networks, and its theoretical properties are also investigated under regularity conditions. Using different synthetic models and a real-world example, we demonstrate the advantage of the proposed algorithm over other alternatives in terms of identifying useful features, avoiding useless ones, and the computation efficiency. Although the algorithm is proposed using standard random forests, it can be widely adapted to other machine learning algorithms, as long as features can be sorted accordingly.
翻译:随机森林是一种广泛使用的机算学习算法,但当应用于具有众多实例和无用特征的大型数据集时,其计算效率会受到损害。在这里,我们提议采用一种非参数特征选择算法,其中包括随机森林和深神经网络,其理论特性也在正常情况下调查。我们使用不同的合成模型和一个真实世界的例子,在确定有用特征、避免无用特征和计算效率方面,展示了拟议算法相对于其他替代方法的优势。虽然提议采用标准随机森林,但这种算法可以广泛适用于其他机器学习算法,只要能够据此对特征进行分类。