A common method in training neural networks is to initialize all the weights to be independent Gaussian vectors. We observe that by instead initializing the weights into independent pairs, where each pair consists of two identical Gaussian vectors, we can significantly improve the convergence analysis. While a similar technique has been studied for random inputs [Daniely, NeurIPS 2020], it has not been analyzed with arbitrary inputs. Using this technique, we show how to significantly reduce the number of neurons required for two-layer ReLU networks, both in the under-parameterized setting with logistic loss, from roughly $\gamma^{-8}$ [Ji and Telgarsky, ICLR 2020] to $\gamma^{-2}$, where $\gamma$ denotes the separation margin with a Neural Tangent Kernel, as well as in the over-parameterized setting with squared loss, from roughly $n^4$ [Song and Yang, 2019] to $n^2$, implicitly also improving the recent running time bound of [Brand, Peng, Song and Weinstein, ITCS 2021]. For the under-parameterized setting we also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
翻译:在神经网络培训中,一个常见的方法是将所有重量初始化为独立的高斯矢量。我们观察到,通过将重量初始化为独立的双对,每对由两个相同的高斯矢量组成,我们可以大大改进趋同分析。虽然对随机输入进行了类似的技术研究[Daniely, NeurIPS 2020],但还没有用任意输入来分析这种技术。使用这一技术,我们展示了如何将双层ReLU网络所需的神经元数量大幅减少,无论是在后勤损失不足的分计环境下,从大约$\gamma ⁇ -8}[Ji和Telgarsky, ICLR 美元,到$\gamma ⁇ -2],在随机输入[Damilt Tennell, Neuralnel 和以平方损失的超标定设置中,从大约$4美元[Song and Yang,20199]到2美元,还隐含地改进了最近一段时间(Brand, Peng, Song and Weinstest)的约束, 在202号假设之下,也得到了最佳的改进。