Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from diminishing returns in terms of compute efficiency for large batch sizes. In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
翻译:在大型数据集培训的深层学习模型在视觉和语言领域都取得了广泛的成功。随着最先进的深层学习结构在参数计数方面不断增长,因此计算其所需的预算和时间也不断增长,因此增加了对平行培训的计算效率方法的需要。对深层网络培训进行平行培训的两种共同做法是数据和模型平行。虽然有用,但数据和模型平行主义在计算大批量规模的计算效率方面受到回报下降的影响。在本文件中,我们调查如何继续有效地在通过地方平行主义减少大批量的回报点以外进行计算。这个框架将深层网络中单个层的培训平行化,用脱轨的层反光反光反光法取代全球反光调整法。地方平行主义使低记忆脚印的平整层平行法能够完全不同步,与模型平行法相比需要很少的通信管理费。我们展示了不同结构组合的视觉和语言领域的结果,并发现高配置制度中的地方平行性特别有效。