We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions, but whose label only vary within a linear manifold of dimension $d_\parallel < d$. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d_\perp=d-d_\parallel$ uninformative directions. These are effectively compressed by a factor $\lambda\sim \sqrt{p}$, where $p$ is the size of the training set. We quantify the benefit of such a compression on the test error $\epsilon$. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that $\epsilon \sim p^{-\beta}$, with $\beta_\text{Lazy} = d / (3d-2)$. Compression improves the learning curves so that $\beta_\text{Feature} = (2d-1)/(3d-2)$ if $d_\parallel = 1$ and $\beta_\text{Feature} = (d + d_\perp/2)/(3d-2)$ if $d_\parallel > 1$. We test these predictions for a stripe model where boundaries are parallel interfaces ($d_\parallel=1$) as well as for a cylindrical boundary ($d_\parallel=2$). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find $\beta_\text{Feature}>\beta_\text{Lazy}$.
翻译:我们研究的是神经网络如何在模型中压缩不知情的输入空间, 数据以美元维度为单位, 但其标签只在维度的线性方块内有差异 $d ⁇ parallel < d$。 我们显示, 对于一个顶层层网络, 初始具有无限质量重量( 即功能学习制度), 第一层重量会变得对 $\ perp= d ⁇ d ⁇ d ⁇ parell $unfincial 方向几乎不敏感 。 这些数据实际上被一个因子 $\ lambda\ sim\ sertar{ { rockrock} $( $美元 美元) 的直线性组合压缩。 对于重量的大规模初始化( 懒惰性培训制度), 不进行压缩, 而对于分解标签的常规界限, 我们发现, 美元== delcial= dentrial a.