In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate, refuting the conjecture of Malach and Shalev-Shwartz that 'deeper is better only when shallow is good'. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
翻译:在这项工作中,我们提供了在随机初始化后后勤损失方面由梯度下降所培训的双层ReLU网络的特征学习过程的特征描述。 我们考虑了输入特性的 XOR 类似函数产生的带有二进制标签的数据。 我们允许训练标签的固定部分被对手腐蚀。 我们显示,虽然线性分类器并不比随机猜测我们所考虑的分布更好,但由梯度下降所培训的双层ReLU网络在标签噪音率上取得了普遍化错误,驳斥了Malach 和 Shalev-Shwartz 的洞穴,“只有浅度才更好 ” 。 我们开发了一个新颖的证明技术, 表明在初始化时, 绝大多数神经元的随机特性只能与有用的特性有微弱的关联, 梯度下沉动态“ 放大” 这些薄弱的随机特征, 使其变得强大、 有用。