Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a resource budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and stochastic gates, typically select all of the features in one evaluation round, ignoring the residual value of the features during selection (i.e., the marginal contribution of a feature conditioned on the previously selected features). We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient implementation of greedy forward selection and uses attention weights at each step as a proxy for marginal feature importance. We provide theoretical insights into our Sequential Attention algorithm for linear regression models by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit algorithm [PRK1993], and thus inherits all of its provable guarantees. Lastly, our theoretical and empirical analyses provide new explanations towards the effectiveness of attention and its connections to overparameterization, which might be of independent interest.
翻译:对于神经网络,先前的方法,包括基于$@ell_1$正规化、注意力和随机门的方法,通常选择一个评价回合中的所有特征,忽略了选择过程中特征的剩余价值(即以先前选定的特征为条件的一个特征的边际贡献)。我们建议了一种特征选择算法,称为 " 序列注意 ",实现神经网络最新的经验结果。这一算法基于有效实施贪婪的前期选择,并在每一步使用注意力权重作为边际特征重要性的替代物。我们从理论上了解线性回归模型的顺序注意算法,表明对这一环境的适应相当于古典的Orthogonical匹配跟踪算法[PRK1993],从而继承了它所有可证实的保证物。最后,我们的理论和经验分析为注意力的有效性及其与过度量化的联系提供了新的解释,这可能具有独立的兴趣。