The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
翻译:机器学习的核心任务在于最小化衡量模型对训练数据拟合程度的损失函数。实现高效优化的数值方法取决于损失函数的性质,其中最关键的特性是损失函数的凸性或非凸性。由于损失函数可能存在(且经常存在)非凸区域,这导致业界广泛采用Adam等非凸优化方法。然而,局部极小值意味着在其邻域内函数具有凸性。在该区域内,共轭梯度(CG)等二阶优化方法能够保证超线性收敛。本文提出一个新颖的理论框架,其基础假设是:实际任务中的损失函数在初始阶段呈现非凸性,而在趋近最优解时会转为凸性。我们利用这一特性设计了一种创新的两阶段优化算法。该算法通过观测梯度范数与损失值的依赖关系来检测凸性转换点,并分别在非凸区域采用Adam算法、在凸区域采用CG算法进行计算。实验结果表明,这种简单的凸性结构在实际问题中频繁出现,足以通过本文方法有效提升收敛速度与精度。