Averaging neural network weights sampled by a backbone stochastic gradient descent (SGD) is a simple yet effective approach to assist the backbone SGD in finding better optima, in terms of generalization. From a statistical perspective, weight averaging (WA) contributes to variance reduction. Recently, a well-established stochastic weight averaging (SWA) method is proposed, which is featured by the application of a cyclical or high constant (CHC) learning rate schedule (LRS) in the process of generating weight samples for the WA operation. Then a new insight on WA appears, which states that WA helps to discover wider optima and then leads to better generalization. We conduct extensive experimental studies for SWA, involving a dozen modern DNN model structures and a dozen benchmark open-source image, graph, and text datasets. We disentangle contributions of the WA operation and the CHC LRS for SWA, showing that the WA operation in SWA still contributes to variance reduction but does not always lead to wide optima. We show how the statistical and geometric views on SWA reconcile. Based on our experimental findings, we raise a hypothesis that there are global scale geometric structures in the DNN loss landscape that can be discovered by an SGD agent at the early stage of its working period, and such global geometric structures can be exploited by the WA operation. This hypothesis inspires an algorithm design termed periodic SWA (PSWA). We find that PSWA outperforms its backbone SGD remarkably during the early stage of the SGD sampling process, and thus demonstrate that our hypothesis holds. Codes for reproducing the experimental results can be found at https://github.com/ZJLAB-AMMI/PSWA.
翻译:由主干网梯度梯度下降(SGD)取样的神经网速权重是协助主干SGD在一般化方面找到更好选择的简单而有效的方法,从统计角度看,平均权重(WA)有助于减少差异。最近,提出了一种成熟的神经网量平均(SWA)法,其特点是采用周期性或高常数(CHC)学习率表(LRS),为WA行动生成重量样本。然后出现了对WA的新认识,指出WA有助于发现更广泛的Opima,然后导致更好的普遍化。我们为SWA进行了广泛的实验性研究,涉及十多个现代DNN模型结构以及十多个基准开放源图像、图表和文本数据集。我们分化了WA行动和CHC学习率表(CHC)的贡献,表明SWA的行动仍然有助于减少差异,但并不总是导致广泛的选择。我们展示了SWA的统计和几何观点如何在SBA的轨道模型阶段调和SBAFAS的模型模型设计阶段,因此,我们在SBA的早期的SBA值结构结构结构结构中可以得出一种假设,我们在SBAFA的早期的SBA值结构中可以发现,我们在SBA的SBA值结构结构结构上发现,在SBA的早期的测值值值值值值值值值值值值值值值值结构的模型的模型的模型的模型的模型可以进行一个假设。