Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed.
翻译:特性选择是开发强大和强大的机器学习模型的关键一步。 特性选择技术可以分为两类: 过滤和包装方法。 包装方法通常导致强烈的自动预测性性能, 包装方法具有巨大的计算复杂性, 因而需要大量时间来完成, 特别是在处理高维功能集时。 或者, 过滤方法要快得多, 但也有其他几个缺点, 例如 (一) 要求一个阈值, (二) 不考虑各功能之间的相互关系, (三) 忽略与模型的特性互动。 为此, 我们展示了一种新型的包装特性选择方法, 这是一种新型的管道特性选择方法, 利用统计假设测试和动力计算, 并结合Shapley值进行快速和直观性功能选择。 Powerhap是建立在一个核心假设上, 即信息性特征对预测的影响比已知的随机特性要大得多。 基准和模拟表明, 能力优于其他过滤方法, 预测性方法的比预感性性功能更强, 同时又能大大更快, 往往达到半或更精确的管道性功能选择方法。 在快速和直观的轨道上, 使用一种快速的计算法系 。 。 使用一种可提供快速、 快速和快速操作 使用 使用 和快速操作 。