Feature or variable selection is a problem inherent to large data sets. While many methods have been proposed to deal with this problem, some can scale poorly with the number of predictors in a data set. Screening methods scale linearly with the number of predictors by checking each predictor one at a time, and are a tool used to decrease the number of variables to consider before further analysis or variable selection. For classification, there is a variety of techniques. There are parametric based screening tests, such as t-test or SIS based screening, and non-parametric based screening tests, such as Kolmogorov distance based screening, and MV-SIS. We propose a method for variable screening that uses Bayesian-motivated tests, compare it to SIS based screening, and provide example applications of the method on simulated and real data. It is shown that our screening method can lead to improvements in classification rate. This is so even when our method is used in conjunction with a classifier, such as DART, which is designed to select a sparse subset of variables. Finally, we propose a classifier based on kernel density estimates that in some cases can produce dramatic improvements in classification rates relative to DART.
翻译:大型数据集本身就存在特性或可变选择问题。 虽然已经提出了许多方法来解决这个问题, 但有些方法可能无法与数据集中的预测数相适应。 通过一次检查每个预测数,筛选方法可以线性地与预测数相比,并且是一种工具,用来减少在进一步分析或可变选择之前要考虑的变量数量。 关于分类, 存在着多种技术。 存在基于参数的筛选测试, 如测试或基于SIS的筛选, 以及非参数的筛选测试, 如基于 Kolmogorov 距离的筛选和MV- SIS。 我们提出了使用Bayesian动机的测试的变量筛选方法, 将其与基于SIS的筛选进行比较, 并提供模拟和真实数据方法的应用实例。 这表明, 我们的筛选方法可以导致分类率的提高。 即便我们的方法与一个分类器一起使用, 例如DART, 设计用来选择一个稀多的变量组。 最后, 我们建议一种基于内核密度估计的分类方法, 在某些情况下, 能够产生相对于DART的急剧的分类率的改进。