Self-attention suffers from quadratic memory requirements with respect to the sequence length. In this work, we propose sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs efficiently. Our approach is compatible with most existing parallelisms. More importantly, we no longer require a single device to hold the whole sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs.
翻译:自留因序列长度的二次记忆要求而受到影响。 在这项工作中, 我们提出序列平行法, 这是一种记忆高效的平行法, 以帮助我们打破输入序列长度限制, 并高效地在 GPU 上进行较长序列的训练。 我们的方法与大多数现有的平行法是兼容的。 更重要的是, 我们不再需要单个设备来保持整个序列。 具体地说, 我们将输入序列分成多个块, 并将每个块输入相应的设备( e. GPU ) 。 为了计算关注输出, 我们将环形通信与自留计算和拟议环形自留( RSA ) 整合起来。 实验显示, 序列平行法与批量大小和序列长度相比, 当向64 NVIDIA P100 GPUs 扩展时, 我们的方法分别达到了1.37美元和3.0美元的最大批量和序列长度。