Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
翻译:对于神经网络,先前的方法,包括基于$\ell_1美元正规化、注意力和其他技术的方法,通常在一回合评价中选择整个特性子集,忽略了选择期间特征的剩余价值,即,一个特性的边际贡献,因为其他特性已经选定。我们提议了一个特性选择算法,称为“序列注意”,实现神经网络最新的经验结果。这一算法基于对贪婪前期选择的有效一次性实施,并在每一步使用注意权重作为特征重要性的代名词。我们从理论上洞察我们的线性回归算法,表明对这个环境的适应相当于古典的Orthogoal Contracting Contracit(OMP)算法,从而继承了它的所有可实现的保证。我们的理论和经验分析为关注的有效性及其与超分数化的联系提供了新的解释,这些可能是独立的。</s>