There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memory-saving resource-efficient training framework for Transformers. Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. The low-precision activations are then dequantized during back-propagation to compute gradients. Besides, to address the heterogeneous activation distributions in the multi-head self-attention layers, we propose a head-wise activation quantization strategy, which quantizes activations based on the statistics of each head to minimize the approximation error. To further boost training efficiency, we learn quantization parameters by running estimates. More importantly, by re-investing the saved memory in employing a larger batch size or scaling up model size, we may further improve the performance under constrained computational resources. Extensive experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training while achieving comparable or even better performance. Code is available at https://github.com/zhuang-group/Mesa
翻译:设计高性能变换器的兴趣激增。 虽然变压器带来了显著的性能改进, 但培训这些网络的记忆力非常密集, 因为存储了在后推进阶段, 特别是长序列中, 梯度计算所需的所有中间引爆器, 特别是对于长序列而言。 为此, 我们向Mesa展示一个为变压器提供节省记忆的资源高效培训框架。 具体地说, Mesa在前传过程中使用精确引爆器, 同时储存低精度的启动器, 以减少培训期间的内存消耗量。 低精度的启动器随后在后演算中进行分解, 以计算梯度。 此外, 为了解决多头自我注意层的杂交激活分布问题, 我们提出一个头进化振动四分化战略, 该战略根据每个头部的统计进行量化, 以尽量减少近似误差。 为了进一步提高培训效率, 我们通过运行估算来学习四分化参数。 更重要的是, 在使用较大批量或扩大模型规模时, 将所保存的记忆进行再投资, 我们可以进一步改进在有限的计算资源中进行。 在图像网/ MFAR100期间进行大规模测试,, 在可比较的MAAS ADADSADAY 期间进行半的测试期间, SAADM100 中进行大规模测试, 。