Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of "high capacity" features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.
翻译:最近的理论结果显示,在指数性损失功能下深神经网络中,局部的梯度下降会最大限度地增加分类差值,这相当于最大限度地减少比值限制下重量矩阵的规范。但这一解决方案的特性并不完全说明一般化业绩。我们从理论上激励并用经验表明,培训成套材料比值曲线下的区域事实上是一种很好的概括性尺度。然后我们表明,在数据分离完成后,有可能在不造成重大性能损失的情况下,动态地将培训减少99%以上。 有趣的是,由此产生的“高容量”特征子集在不同的培训中并不一致,这与理论上的说法是一致的,即所有培训点都应与SGD下的相同性能差点趋同,同时存在批次正常化和重量衰减。