This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.
翻译:本文确定了数据分布的结构属性, 使深神经网络能够按等级进行学习。 我们定义了布尔兰超立方体功能的“ 楼层” 属性, 其中假设, 高阶的四倍系数可以从不断增长的链条上的低阶四倍系数中获取。 我们证明, 满足这一属性的功能可以在多元时间里学习, 使用在常规神经网络上的分层相近协调的基底, 这是一种具有同质特性的网络架构和初始化。 我们的分析表明, 对于这些楼梯功能和神经网络而言, 梯度算法通过贪婪地结合网络深度的低层特征来学习高层次的特征。 我们进一步支持我们的理论结果, 实验显示, 更标准的 ResNet 结构也可以通过具有相近梯度梯度梯度梯度的梯度下降来学习。 理论和实验结果都支持这样的事实, 即: 阶梯度特性在理解常规网络的梯度学习能力方面可以发挥作用,, 与能够模仿最近显示的任何 SQ 或 PAC 算法的普通多度网络相比, 。