Do neural networks generalise because of bias in the functions returned by gradient descent, or bias already present in the network architecture? Por qu\'e no los dos? This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin. This conclusion is based on a careful study of the behaviour of infinite width networks trained by Bayesian inference and finite width networks trained by gradient descent. To measure the implicit bias of architecture, new technical tools are developed to both analytically bound and consistently estimate the average test error of the neural network--Gaussian process (NNGP) posterior. This error is found to be already better than chance, corroborating the findings of Valle-P\'erez et al. (2019) and underscoring the importance of architecture. Going beyond this result, this paper finds that test performance can be substantially improved by selecting a function with much larger margin than is typical under the NNGP posterior. This highlights a curious fact: minimum a posteriori functions can generalise best, and gradient descent can select for those functions. In summary, new technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent. Code for this paper is available at: https://github.com/jxbz/implicit-bias/.
翻译:由于梯度下降或网络架构中已经存在的偏差而导致的功能偏差,神经网络泛泛化;Por qu\'e no los do?本文认为,虽然符合培训数据的典型网络已经相当准确地概括了培训数据,但坡度下降可以通过选择大差幅的网络来进一步改进一般化。这一结论基于对由巴伊西亚推论和梯度下降所培训的有限宽度网络的行为的仔细研究。为了测量结构的隐性偏差,开发了新的技术工具,以分析方式约束和一致地估计神经网络-加西南进程(NNNGP)后部的平均测试错误。发现这一错误已经好过机会,证实了Valle-P\'erez 等人(2019年)的调查结果,并强调了结构的重要性。除了这一结果之外,本文还发现,通过选择比NNGP的后缘差大得多得多的功能,可以大大改进测试性能。这突出了一个令人好奇的事实:最起码的后端功能可以概括为最佳的,而梯度血统/梯度的梯度血统可以选择这些隐化工具。