Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU heterogeneity combine to limit accuracy and increase the time to convergence. We address these challenges with Adaptive SGD, an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs that is characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging. Instead of statically partitioning batches to GPUs, batches are routed based on the relative processing speed. Batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. Normalized model merging computes optimal weights for every GPU based on the assigned batches such that the combined model achieves better accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and is scalable with the number of GPUs.
翻译:在极端多标签分类应用程序的推动下,我们考虑培训深层次学习模型,而不是多GPU服务器中稀少的数据。不同培训批次的非零特征数量的差异和内在的GPU异质性,这限制了准确性,增加了趋同时间。我们用适应性弹性模型SGD来应对这些挑战,SGD是一种适应性弹性模型,它平均是不同多组GPU的随机梯度梯度下行算法,其特点是动态调度、适应性批次缩放和普通化模型合并。不是静态分配批次到GPUS,而是根据相对处理速度选择批次。批次的大小缩放规模将较大批次分配到更快的GPUPS,小批次分配到较慢的组次,目标是达到一个稳定状态,所有GPUPS都进行相同数目的模型更新。正常化模型根据所指定的批次对每个GPUPU的最佳加权比重进行合并,这样联合模型就能更准确。我们实验性地显示,适应性 SGDGD优于时间到轨道上的4个状态和比例。