Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at https://github.com/deepmind/ deepmind-research/tree/master/nfnets
翻译:批量正常化是大多数图像分类模型的一个关键组成部分,但它有许多不可取的特性,因为它依赖批量规模和实例之间的相互作用。 尽管最近的工作成功地在培训深度的ResNet中培训了深度的ResNet,但没有正常化层,但这些模型与最佳批量标准化网络的测试范围不匹配,而且对于高学习率或强数据增强而言往往不稳定。在这项工作中,我们开发了适应性梯度剪裁技术,克服了这些不稳定性,并设计了一个显著改进的普通化无源ResNet类。我们较小的模型匹配了图像网络上高效的Net-B7的测试精度,同时正在更快地培训8.7x,而我们最大的模型达到了新的艺术一级-1的精度,即86.5%。此外,在对3亿个标签图像集进行大规模预培训后,在对图像网络进行微调时,无源模型的性能大大优于其分正比。 我们的最佳模型获得了89.2%的精度。我们的代码可以在 https://github.com/diepmind/ deepnet/strenetmaster/strestrain/mastery。