使用信息标准来选择最小广场倒退中的子集选择 (On the Use of Information Criteria for Subset Selection in Least Squares Regression)

from arxiv, Code to reproduce the results in this paper, the complete set of simulation results, and an R package 'BOSSreg' (also available on CRAN), are publicly available at https://github.com/sentian/BOSSreg

Least squares (LS)-based subset selection methods are popular in linear regression modeling. Best subset selection (BS) is known to be NP-hard and has a computational cost that grows exponentially with the number of predictors. Recently, Bertsimas (2016) formulated BS as a mixed integer optimization (MIO) problem and largely reduced the computation overhead by using a well-developed optimization solver, but the current methodology is not scalable to very large datasets. In this paper, we propose a novel LS-based method, the best orthogonalized subset selection (BOSS) method, which performs BS upon an orthogonalized basis of ordered predictors and scales easily to large problem sizes. Another challenge in applying LS-based methods in practice is the selection rule to choose the optimal subset size k. Cross-validation (CV) requires fitting a procedure multiple times, and results in a selected k that is random across repeated application to the same dataset. Compared to CV, information criteria only require fitting a procedure once, but they require knowledge of the effective degrees of freedom for the fitting procedure, which is generally not available analytically for complex methods. Since BOSS uses orthogonalized predictors, we first explore a connection for orthogonal non-random predictors between BS and its Lagrangian formulation (i.e., minimization of the residual sum of squares plus the product of a regularization parameter and k), and based on this connection propose a heuristic degrees of freedom (hdf) for BOSS that can be estimated via an analytically-based expression. We show in both simulations and real data analysis that BOSS using a proposed Kullback-Leibler based information criterion AICc-hdf has the strongest performance of all of the LS-based methods considered and is competitive with regularization methods, with the computational effort of a single ordinary LS fit.

翻译：以最小方( LS) 为基础的子集选择方法在线性回归模型中很受欢迎。最佳子集选择( BS) 方法已知是 NP- 硬, 计算成本随着预测器的数量而成倍增长。最近, 伯特西玛斯( Bertsimas) 将 BS 设计成混合整数优化( MIO) 问题, 通过使用开发完善的优化解析器大大降低了计算间接费用。但目前的方法无法向非常大的数据集缩放。在本文中, 我们提出一种新的基于 LS 的方法, 最佳或更精确的子选择( BOS ), 最佳的子集选方法, 最佳的子集选集选择( 最佳整整数) 。信息标准只需要一次程序, 最精确化的子集分选( 与 CVS 相比, 信息标准只需要一次安装程序, 但它们需要了解正常的 BSOS 标准, 和正常的直径直径直径( ) 和直径的直径( 直径直) 的直径( 直) 直线) 和直径( 直径) 直径) 的直径直) 的解( 和直径直) 直) 的解( 直) 和直) 的解( 平流) 的直径直) 的直), 的解( 和直径直径直) 的直) 的直) 的解( 直) 直) 的直) 的解(, 和直径直径解) 的解( 的解) 的解( 和直) 直) 的直) 的直) 的直) 的直) 的直) 的直) 的解( 的直) 和直) 和直) 和直), 的解(,, 的(,,,,,,, 和直路的,, 的( 的( 和直路的解), 的解) 和直径直路的,, 和直径直径直径直路的的),, 的