Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
翻译:渐渐平衡( GraB) 是最近提出的一种技术, 在使用多步数的模型对有限数据集进行培训时, 发现数据更加精确。 它比广泛采用的随机重整速度更快, 其方法是将相邻选择的示例中的梯度差异最小化。 然而, GraB 只在小批量大小和集中数据等关键假设下运作, 从而可以解决如何大规模订购示例的问题 -- 即以分散数据进行分布式学习。 为了减轻限制, 我们在此文件中建议 D- graB 包含两个新设计:(1) $\ textsf{ PairBalance}, 它比广泛采用的随机重调整速度要快得多, 因为它非常依赖小的学习率; (2) 在一个分布式环境中运行$\ textsf{ PairBalBalance} 协议, 它从数据订购和平行数据中受益。 我们证明 D- GraB 以直线速度上升了 以 $( m) (mT\) T- t 3} 在平滑的 O- 2 train 里值操作中, 和 Real- demoteal de de demodes lades Prodeal files files files files files files files nu nu files nu 和 nu nu nu ex number number number 和 number nuts (我们 ) a nu nu nu nuts) 和 nu nuts (我们 $) 和 数。