Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB, an algorithm that orders the examples in a parallel setting with negligible overhead, which enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. D-GraB benefits from both data ordering and parallelism. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
翻译:重度平衡( GraB) 是最近提出的一种技术, 在对多个时代的培训模型进行有限数据集的培训时, 发现数据更佳的数据差异。 它比广泛采用的随机重整速度更快, 最大限度地缩小了相邻选定实例上梯度的差异。 然而, 格拉布只在小批量大小和集中数据等关键假设下运作, 留下如何大规模订购示例的问题, 即分散学习分散的数据。 为了减轻限制, 我们在此文件中提议D- 格拉布, 在一个可忽略不计的间接费用平行设置中排序示例的算法, 其直线速度以 $\ tilde{ (mnT)\\\\\\\\\\\\\\ 2/3} (mn) 美元在平滑的无连接目标上和 $\ tilde{ O} (mT) 和 集中数据条件下运行, 美元表示平行工人人数, 美元表示每个工人的示例, 和 美元表示Eposchurling E- 2 和Sqouria- trainalismal 应用软件, DGravial- sural- sold supal- supaltimaxlation- 和DI- supal- supal- supal- supal- 。</s>