Many machine learning and data science tasks require solving non-convex optimization problems. When the loss function is a sum of multiple terms, a popular method is stochastic gradient descent. Viewed as a process for sampling the loss function landscape, the stochastic gradient descent is known to prefer flat local minimums. Though this is desired for certain optimization problems such as in deep learning, it causes issues when the goal is to find the global minimum, especially if the global minimum resides in a sharp valley. Illustrated with a simple motivating example, we show that the fundamental reason is that the difference in the Lipschitz constants of multiple terms in the loss function causes stochastic gradient descent to experience different variances at different minimums. In order to mitigate this effect and perform faithful optimization, we propose a combined resampling-reweighting scheme to balance the variance at local minimums and extend to general loss functions. We also explain from the stochastic asymptotics perspective how the proposed scheme is more likely to select the true global minimum when compared with the vanilla stochastic gradient descent. Experiments from robust statistics, computational chemistry, and neural network training are provided to demonstrate the theoretical findings.
翻译:许多机器学习和数据科学任务都要求解决非convex优化问题。 当损失函数是多个条件的总和时, 一个流行的方法是随机梯度下降。 被视为对损失函数景观进行抽样的一个过程, 已知随机梯度下降偏向于平坦的本地最小值。 虽然这是某些优化问题( 如深层学习)所希望的, 但当目标是找到全球最低值时, 特别是当全球最低值位于一个尖锐的峡谷时, 它会产生问题。 以简单的激励示例为例, 我们显示, 根本原因是, Lipschitz 多术语常数在损失函数中的差异导致随机梯度下降, 在不同的最低值上经历不同的差异。 为了减轻这一影响, 并实现忠实的优化, 我们提议了一个综合的重标比方案, 以平衡本地最低值的差异, 并扩展到一般的损失功能。 我们还从随机性最低值的刺激角度解释, 提议的计划如何在与香草梯度梯度梯度梯度梯度梯度下降相比, 更可能选择真正的全球最低值。 从稳健的统计、 计算化学、 和神经学训练网络的实验显示, 。