The best subset selection (or "best subsets") estimator is a classic tool for sparse regression, and developments in mathematical optimization over the past decade have made it more computationally tractable than ever. Notwithstanding its desirable statistical properties, the best subsets estimator is susceptible to outliers and can break down in the presence of a single contaminated data point. To address this issue, a robust adaption of best subsets is proposed that is highly resistant to contamination in both the response and the predictors. The adapted estimator generalizes the notion of subset selection to both predictors and observations, thereby achieving robustness in addition to sparsity. This procedure, referred to as "robust subset selection" (or "robust subsets"), is defined by a combinatorial optimization problem for which modern discrete optimization methods are applied. The robustness of the estimator in terms of the finite-sample breakdown point of its objective value is formally established. In support of this result, experiments on synthetic and real data are reported that demonstrate the superiority of robust subsets over best subsets in the presence of contamination. Importantly, robust subsets fares competitively across several metrics compared with popular robust adaptions of continuous shrinkage estimators.
翻译:最佳子集选择( 或“ 最佳子集 ” ) 估计值是稀释回归的经典工具, 过去十年数学优化的发展使其在计算上比以往任何时候更加容易。 尽管它具有理想的统计特性, 最佳子集估计值很容易受到外部线的影响, 并且可以在出现单一污染数据点时分解。 为了解决这个问题, 提议对最佳子集进行强有力的调整, 在反应和预测器中都非常耐受污染。 调整后的估计值将子集选择的概念概括到预测器和观测器中, 从而在宽度之外实现稳健。 这个程序被称为“ robust子集选择”( 或“ robust子集 ), 由组合优化问题来界定, 使用现代离散优化方法。 测量器在目标值的有限分布点上非常稳健健健健。 为支持这一结果, 合成和真实数据的实验显示, 稳健的子集成子组优于最佳子组, 在不断的受污染状态下, 稳健健且具有高压性 。