In this paper, we consider hybrid parallelism -- a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least $100\times$ and $20\times$ during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37\%, without any loss in performance.
翻译:在本文中,我们考虑混合平行 -- -- 一种使用数据平行(DP)和模型平行(MP)的范例 -- -- 以扩大大型建议模型的分布式培训规模。我们提议了一个压缩框架,称为动态通信推进(DCT),用于通信高效混合培训。DCT过滤整个网络中要通过简单的硬保存功能传递的实体,只允许最相关的信息通过。为了通信高效DP,DCT压缩模型同步期间发送到参数服务器的参数梯度。这个阈值每几千次更新一次,以减少压缩的计算间接成本。对于通信高效的MP,DCT采用了一种新型技术,以压缩在前向和后向传播期间发送的启动和梯度。这只能通过确定和更新数据中每个培训样本的神经网络中最相关的神经元来完成。我们评估了公开提供的自然语言模型处理和建议模型和数据集,以及在Facebook上使用的建议系统。 DCT将通信减少至少100美元和20美元的成本,DCT包含在DP和DM最后的动作中, 改进了它。