Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to compress the message size in communication. Previous approaches have primarily focused on compressing gradients in a data parallelism setting, but compression in a model-parallel setting is an understudied area. We have discovered that model parallelism has fundamentally different characteristics than data parallelism. In this work, we present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms - pruning-based, learning-based, and quantization-based - using a popular Transformer training framework. We evaluate these methods across more than 160 settings and 8 popular datasets, taking into account different hyperparameters, hardware, and both fine-tuning and pre-training stages. We also provide analysis when the model is scaled up. Finally, we provide insights for future development of model parallelism compression algorithms.
翻译:大型变压器模型因其在一系列任务中的出色表现而闻名于世,但由于通信密集型模型平行的要求,培训这些模型可能很困难。提高培训速度的一个方法是压缩通信中的信息大小。以前的方法主要侧重于在数据平行环境下压缩梯度,但模型平行环境中压缩是一个研究不足的领域。我们发现模型平行特征与数据平行阶段有根本的不同特征。在这项工作中,我们提出了关于模型平行压缩方法有效性的第一次实证研究。我们使用流行的变压器培训框架,实施和评估三种共同的压缩算法――基于运行的、基于学习的和量化的。我们评估了160多个设置和8个流行数据集中的这些方法,同时考虑到不同的超参数、硬件以及微调和训练前两个阶段。当模型扩展时,我们还提供了分析。最后,我们为未来开发模型平行压缩算法提供了见解。