The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
翻译:Frank-Wolfe 方法在统计和机器学习应用中越来越有用,因为迭代在结构上具有启发性,特别是在一些环境中,在可行数据集上线性最小化比预测效率更具有计算效率。在确定 " 经验风险最小化 " -- -- 统计和机器学习中最根本的优化问题之一 -- -- 时,Frank-Wolfe 方法的计算效力通常在数据观测数量上线性地增长。这与典型的随机投影方法相比是明显不同的。为了减少对美元的依赖,我们期待对典型的平滑损失功能(例如,东部广场损失和后勤损失)的第二阶级平稳性,我们提议修改Frank-Wolfe 方法,采用泰勒系列近似梯度梯度的方法,包括确定性和随机环境的变异体。与当前最先进的方法相比,在最优化的容忍度 $\ varepslonlon 方法方面,我们的方法能够同时减少对大额美元的依赖,同时获得最优化的平滑损率(比如,比如,比如,如,最小广场损失和后勤损失损失损失损失) 以及后勤损失) 和最优化的折叠计算方法,我们提出的最优化的超前的推算法式方法。