Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.
翻译:深层学习随着大型神经网络和大型数据集而蓬勃发展。然而,更大的网络和更大的数据集导致培训时间的延长,从而阻碍了研发进步。分布的同步 SGD 通过将 SGD 小型插管分成一组平行工人,为这一问题提供了潜在的解决方案。然而,要使该计划有效,每个工人的工作量必须是巨大的,这意味着SGD小型小批量的无边增长。在本文中,我们的经验显示,在图像网数据集大型微型插头上,大型小型小插头造成优化困难,但当处理这些时,经过培训的网络表现出良好的普遍性。具体地说,在以大型小型小批量的图像到8192张图像进行的培训时,我们并没有丧失准确性。要达到这一结果,我们采用了无超参数的线缩缩线规则,以调整学习率,作为小型批量的函数,并开发新的暖化计划,以克服培训初期的优化挑战。用这些简单技术,我们的Cafe2基系统在256 GPPS上培训ResNet-50,一个小时的小型批量为8192,同时将小型小型小型小型小型小型小型小批次小批次的智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能化数据升级从890升级升级到高的智能智能智能智能智能智能智能智能智能智能智能智能测试。使用我们的硬件,使我们的智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能智能测试。