We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
翻译:我们用$L+$2美元正规化来研究DNNs的损失表面。 我们显示, 参数方面的损失可以重塑成一个损失, 也就是从DNNs的层次启动 $@ ell} 美元培训组的分层启动 $@ ell} 美元。 这一重现揭示了特征学习背后的动态: 每一个隐藏的表示 $ ell} 美元都是最佳的 w.r. t 问题, 并且将输入和输出表示之间的中间插插点, 将输入和输出表示中所需的信息作为构建下层激活所需的信息少一些。 对于正均匀的非线性, 可以从隐藏的表示的表达方式的变异性中进一步重现损失, 其形式是部分的 convex 优化 。 第二次重拟让我们能够证明, 每个隐性表示损失的本地最小值为$%2+1美元, 而每个隐藏层( $N$是培训的大小) 。 我们显示, 这个约束是紧凑紧的, 举一个本地最小值的设置, 也比隐藏的神经值要低得多。