Screening feature selection methods are often used as a preprocessing step for reducing the number of variables before training step. Traditional screening methods only focus on dealing with complete high dimensional datasets. Modern datasets not only have higher dimension and larger sample size, but also have properties such as streaming input, sparsity and concept drift. Therefore a considerable number of online feature selection methods were introduced to handle these kind of problems in recent years. Online screening methods are one of the categories of online feature selection methods. The methods that we proposed in this research are capable of handling all three situations mentioned above. Our research study focuses on classification datasets. Our experiments show proposed methods can generate the same feature importance as their offline version with faster speed and less storage consumption. Furthermore, the results show that online screening methods with integrated model adaptation have a higher true feature detection rate than without model adaptation on data streams with the concept drift property. Among the two large real datasets that potentially have the concept drift property, online screening methods with model adaptation show advantages in either saving computing time and space, reducing model complexity, or improving prediction accuracy.
翻译:传统筛选方法仅侧重于处理完整的高维数据集。现代数据集不仅具有更高的尺寸和更大的样本规模,而且具有流流输入、宽度和概念漂移等特性。因此近年来采用了大量在线特征选择方法来处理这类问题。在线特征选择方法是在线特征选择方法的类别之一。我们在这次研究中提出的方法能够处理上述所有三种情况。我们的研究重点是分类数据集。我们的实验显示,拟议的方法可以产生与离线版本一样的重要特征,其速度更快,储存消耗量较少。此外,结果显示,采用综合模型调整的在线筛选方法具有更高的真实特征检测率,而不是对带有概念漂移属性的数据流进行模型调整。在可能具有概念漂移属性的两大实际数据集中,有模型调整的在线筛选方法在节省计算时间和空间、降低模型复杂性或提高预测准确性方面具有优势。