We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
翻译:我们为稀少的通用线性模型和添加型模型提供快速分类技术,这些技术可以在几分钟内处理数千个特征和数千个观测,即使存在许多高度关联的特征。对于快速稀少的后勤回归,我们的计算速度比其他最佳子集搜索技术要快,这是因为对后勤损失进行了线性和二次替代削减,从而使我们能够有效地筛选特征以便消除,以及使用有利于更统一地探索特征的优先排队。作为后勤损失的替代办法,我们提出了指数损失,这样可以对每次迭代的线条搜索找到一个分析解决方案。我们的计算速度一般比以往方法快2至5倍。它们产生的可解释模型的准确性可与挑战数据集的黑箱算法模型相比。