后向功能校正:深学习如何表现深学习 (Backward Feature Correction: How Deep Learning Performs Deep Learning)

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of Hierarchical Learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. We formally analyze how multi-layer neural networks can perform such hierarchical learning efficiently and automatically by applying SGD. On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how deep neural networks can still be sample and time efficient using SGD on certain hierarchical learning tasks, when NO KNOWN existing algorithm is efficient. We establish a new principle called "backward feature correction", where training higher-level layers in the network can improve the features of lower-level ones. We believe this is the key to understand the deep learning process in multi-layer neural networks. On the technical side, we show for regression and even binary classification, for every input dimension $d>0$, there is a concept class of degree $\omega(1)$ polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time and sample complexity to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions. In contrast, we do not know any other simple algorithm (including layer-wise training or applying kernel method sequentially) that can learn this concept class in $\mathsf{poly}(d)$ time even to any $d^{-0.01}$ error. As a side result, we prove $d^{\omega(1)}$ lower bounds for several non-hierarchical learners, including any kernel methods, neural tangent or neural compositional kernels.

翻译：110层 ResNet 如何利用相对较少的培训实例和短期培训时间来学习高复合性分类器? 我们提出了一个理论, 以等级学习的方式解释这一点。我们将等级学习称为学习者学习代表复杂的目标功能, 将其分解成一系列更简单的功能, 以减少样本和时间复杂性。我们正式分析多层神经网络如何通过应用 SGD 来高效和自动地完成这种等级学习。在概念的一面, 我们展示了( ) 磁度, 以我们所知的最佳方式, 第一次理论结果显示, 当 NO AnowNowNown 现有算法有效时, 深层神经网络仍然可以样本和时间效率。在任何输入的一面, 我们显示 0. 0 美元的神经网络中, 以美元美元, 将一个概念级级级级的, 包括美元美元。