Variable selection is commonly used to arrive at a parsimonious model when relating an outcome to high-dimensional covariates. Oftentimes a selection rule that prescribes the permissible variable combinations in the final model is desirable due to the inherent structural constraints among the candidate variables. Penalized regression methods can integrate these restrictions (which we call "selection rules") by assigning the covariates to different (possibly overlapping) groups and then applying different penalties to the groups of variables. However, no general framework has yet been proposed to formalize selection rules and their application. In this work, we develop a mathematical language for constructing selection rules in variable selection, where the resulting combination of permissible sets of selected covariates, called a "selection dictionary", is formally defined. We show that all selection rules can be represented as a combination of operations on constructs, which we refer to as "unit rules", and these can be used to identify the related selection dictionary. One may then apply some criteria to select the best model. We also present a necessary and sufficient condition for a grouping structure used with (latent) overlapping group Lasso to carry out variable selection under an arbitrary selection rule.
翻译:变量选择通常用于在将结果与高维共变值联系起来时得出一种扭曲的模式。 通常, 选择规则规定最后模式中可允许的变量组合, 由于候选变量中固有的结构限制, 选择规则是可取的。 惩罚性回归方法可以将这些限制( 我们称之为“ 选择规则 ” ) 整合起来, 将共变项分配给不同的( 可能重叠的) 组, 然后对变量组适用不同的处罚。 但是, 还没有提出正式确定选择规则及其应用的一般框架。 在这项工作中, 我们开发了一种数学语言, 用于在变量选择中构建选择规则, 由此将选定的可允许的共变项组合( 称为“ 选择词典 ” ) 正式定义。 我们表明, 所有选择规则都可以作为建筑作业的组合( 我们称之为“ 单位规则 ” ), 用来识别相关的选择词典。 然后, 可能会应用某些标准来选择最佳模式。 我们还为与( laso laso ) 重叠的组组合结构提供了必要和充分的条件, 以便按照任意选择规则进行变量选择。