We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness --- a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard weight decay is equivalent to promoting the $\ell_p$-sparsity ($0<p<1$) of the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.
翻译:我们从古典非参数回归问题的角度来研究神经网络的理论,重点是NN以不同光滑度适应性估计功能的能力 -- -- Besov 或 Bounded Variation (BV) 类函数的属性。关于该问题的现有工作需要根据功能空间和样本大小对NN结构进行调整。我们考虑深RELU 类网络的“Parallel NN”变体,并表明标准重量衰减相当于NNC 等推介一个从端到端学习的函数基点的系数矢量($0 < p < 1美元)的能力 -- -- 即字典。我们利用这一等值进一步确定,通过只调整重量衰减,这种平行NNC在Besov 和 BV 类中都可任意地得出一个与微轴速率相近的估计误差。 值得注意的是,随着NNN越深,其指数越接近最接近于最小峰值。我们的研究为说明了为什么深度以及NNT比内核方法更强大。