大型批量培训的平行学习 (Concurrent Adversarial Learning for Large-Batch Training)

Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on ResNet-50 training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.

翻译：大批量培训在培训大量GPU/TPU处理器的神经网络时已成为一种常用技术。随着批量规模的增加, 随机优化优化器往往会聚集到尖锐的本地微粒上, 导致测试性能退化。目前的方法通常使用广泛的数据增强来增加批量规模, 但是我们发现随着批量规模的增加, 数据增强量的减少而带来绩效增益, 而数据增强在某个点之后会变得不够。在本文中, 我们提议使用对称学习来提高大批量培训的批量规模。尽管这是平滑决定表面和偏向一个平坦区域的一种自然选择, 但对称优化优化优化优化的优化器优化器的优化器优化器通常不会成功应用于大批量培训, 因为每步至少需要两次连续的梯度计算来增加数据增量, 而即使有大量的处理器,我们发现数据增量也会增加两倍的运行时间。为了克服这个问题, 我们提议一种新型的模拟学习(CAdv) 方法, 通过固定参数来进一步调测测测测测测测测测测, 。实验结果显示, Condv 可以成功地将A- AS50 的精度与高级培训与Snet- tria- tria- tria- train- train- train- train a silal- train- train- train- train 一起, 和我们在 SA- triality- triality- tria- tria- tria- tria- train- train- train- train- train- train- train- train- train- train- trial- trainal- train- train- train- train- train- trade- train- train- train- trainal- trial- trial- trial- trial- sal- beal- trial- beal- beal- trial- trial- trial- trial- sal- trade- sal- sal- sal- trade- sal- sal- sal- sal- sal- sal- sal- sal- trial- trial- trial- sal- sal-