Within Transformer, self-attention is the key module to learn powerful context-aware representations. However, self-attention suffers from quadratic memory requirements with respect to the sequence length, which limits us to process longer sequence on GPU. In this work, we propose sequence parallelism, a memory efficient parallelism method to help us break input sequence length limitation and train with longer sequence on GPUs. Compared with existing parallelism, our approach no longer requires a single device to hold the whole sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we communicate attention embeddings among GPUs. Inspired by ring all-reduce, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Our implementation is fully based on PyTorch. Without extra compiler or library changes, our approach is compatible with data parallelism and pipeline parallelism. Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. We plan to integrate our sequence parallelism with data, pipeline and tensor parallelism to further train large-scale models with 4D parallelism in our future work.
翻译:在变换器中,自我注意是学习强大的环境觉悟表达式的关键模块。然而,自我注意是序列长度的二次记忆要求,这限制了我们处理GPU上较长的顺序。在这项工作中,我们提出了序列平行,即记忆高效平行法,以帮助我们打破输入序列的长度限制,在GPU上用较长的顺序进行训练。与现有的平行法相比,我们的方法不再需要一个单一的装置来保持整个序列。具体地说,我们将输入序列分成多个块,并将每个块的平行体输入到相应的设备(即GPU)中。为了计算注意力输出,我们把注意力嵌入GPUs。在环形的激励下,我们用自控计算整合环形通信,并提议环形自控(RSA)。我们的实施完全基于PyTorrcht。没有额外的汇编或图书馆变化,我们的方法就与数据平行和管道平行模式相容。实验显示,随着分批量和顺序的长度的长度,序列在未来运行良好运行。 对比了64-NPIPximal的进度,在最大顺序上,我们实现了13.7和不断的顺序,我们的数据同步。