The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this paper, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.
翻译:第二顺序优化方法,特别是D-KFAC(分布式Kronecker因数近光速)算法,在加速GPU集群深神经网络(DNN)培训方面获得了牵引力,然而,现有的D-KFAC算法不仅需要计算和传播大量二级信息,即Kronecker因数(KFs),在设定梯度的先决条件之前,导致大量计算和通信间接费用以及高记忆足足足迹。在本文中,我们提议DP-KFAC(一个将KF在不同DNN层建造KF的任务分配给不同工人的新的分布式先决条件方案)不仅保留了现有的D-KFAC算法(D-KFAC算法)的趋同特性,而且还使三种好处得以实现:减少建造KFS的计算间接费用,KFs没有通信,以及记忆足迹低。 64GPU集的广泛实验显示,DP-KFAC将计算间接费用减少1.55x-1.65x,通信成本为2.79x-3.15x,通信成本在每州一级更新D1-14。