We consider the problem of best subset selection in linear regression, where the goal is to find for every model size $k$, that subset of $k$ features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. We propose COMBSS, a novel continuous optimization based method that directly solves the best subset selection problem in linear regression. COMBSS turns out to be very fast, potentially making best subset selection possible when the number of features is well in excess of thousands. Simulation results are presented to highlight the performance of COMBSS in comparison to existing popular non-exhaustive methods such as Forward Stepwise and the Lasso, as well as for exhaustive methods such as Mixed-Integer Optimization. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models.
翻译:我们考虑了在线性回归中最佳子集选择的问题,目标是为每个模型规模找到最符合响应的美元,即每组美元,即最符合响应的美元。当现有特征总数与数据样本数量相比非常大时,这尤其具有挑战性。我们提出COMBSS,这是一个新的连续优化方法,直接解决线性回归中最佳子集选择问题。COMBSS的发现非常快,当特征数量远远超过数千个时,有可能使最佳子集选择成为可能。模拟结果将突出COMBSS相对于现有流行的非无穷方法(如 " 前步 " 和 " 拉索 " )的绩效,以及混合- Interger优化等详尽方法的绩效。由于总体表现突出,将最佳子集选择挑战设定为连续优化问题,为大量回归模型的特征提取开辟了新的研究方向。