Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on both ResNet-50 and EfficientNet training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.
翻译:大批量培训在培训大量GPU/TPU处理器的神经网络时已成为一种常用技术。 随着批量规模的增加, 随机优化优化器往往会聚集到尖锐的本地微粒上, 导致测试性能退化。 目前的方法通常使用广泛的数据增强来增加批量规模, 但是我们发现随着批量规模的增加, 数据增强量的减少而带来绩效增益, 而数据增强在某个点之后会变得不够。 在本文中, 我们提议使用对称学习来增加大批量培训的批量规模。 尽管这是平滑决定表面和偏向一个平坦区域的一种自然选择, 但对称优化优化优化的优化器优化器优化器优化器优化器优化器优化器的批量培训, 每步至少需要两次连续的梯度计算, 与香草培训相比,即使有大量的处理器, 也至少翻倍的运行时间。 为了克服这个问题, 我们提出一种新型的模拟学习(ConAdv) 方法, 使用固定参数, 将连续的梯度计算法计算方法, 。 实验结果显示Condv 将第一次AAdv 和高级A 递50 级的精度培训成功的精度和高级的精度结合起来的精度培训, 和高级培训可以同时显示A SA- sal- sal- sal- SA- tr