In high-dimensional statistics, variable selection recovers the latent sparse patterns from all possible covariate combinations. This paper proposes a novel optimization method to solve the exact L0-regularized regression problem, which is also known as the best subset selection. We reformulate the optimization problem from a discrete space to a continuous one via probabilistic reparameterization. The new objective function is differentiable but its gradient often cannot be computed in a closed form. Then we propose a family of unbiased gradient estimators to optimize the best subset selection objectives by the stochastic gradient descent. Within this family, we identify the estimator with uniformly minimum variance. Theoretically, we study the general conditions under which the method is guaranteed to converge to the ground truth in expectation. The proposed method can find the true regression model from thousands of covariates in seconds. In a wide variety of synthetic and semi-synthetic data, the proposed method outperforms existing variable selection tools based on the relaxed penalties, coordinate descent, and mixed integer optimization in both sparse pattern recovery and out-of-sample prediction.
翻译:在高维统计中,变量选择从所有可能的共变组合中恢复了潜伏的稀有模式。 本文提出了一个新颖的优化方法, 以解决精确的 L0 常规回归问题, 也称为最佳子集选择 。 我们重新将优化问题从一个离散的空间重新改成一个通过概率再校准的连续的问题。 新的客观功能是不同的, 但其梯度往往无法以封闭的形式计算 。 然后我们建议建立一个由不偏斜的梯度估计器组成的组合, 以优化通过随机梯度梯度下降的最佳子选择目标 。 在这个家族中, 我们确定具有统一最小差异的顶点 。 从理论上讲, 我们研究保证该方法在什么条件下会与地面真相汇合的一般条件 。 拟议的方法可以从数千个共变数秒的合成和半合成数据中找到真正的回归模型 。 在多种合成和半合成数据中, 拟议的方法比现有的变量选择工具要优于以稀有模式的处罚、 协调的血统和混合整数组化的组合组合组合组合组合式的恢复和模拟预测。