We consider the problem of best subset selection in linear regression, where the goal is to find for every model size $k$, that subset of $k$ features that best fit the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. We propose COMBSS, a novel continuous optimization based method that identifies a solution path, a small set of models of varying size, that consists of candidates for the best subset in linear regression. COMBSS turns out to be very fast, making subset selection possible when the number of features is well in excess of thousands. Simulation results are presented to highlight the performance of COMBSS in comparison to existing popular methods such as Forward Stepwise, the Lasso and Mixed-Integer Optimization. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models.
翻译:我们考虑了在线性回归中最佳子集选择的问题,目标是为每个模型大小找到最适合响应的美元,即每组美元。当与数据样本数量相比,现有特征总数非常大时,这尤其具有挑战性。我们建议采用基于COMBSS这一新的连续优化方法,确定解决方案路径,一套规模不等的小型模型,其中包括线性回归中最佳子集的候选者。COMBSS事实证明非常快,当特征数量远远超过数千个时,就可选择子集。模拟结果将突出COMBSS相对于现有流行方法的绩效,如前步、拉索和混合-英特格优化。由于总体表现突出,将最佳子集选择挑战作为持续优化问题,为大量回归模型的特征提取开辟了新的研究方向。