The increasing size of neural network models has been critical for improvements in their accuracy, but device memory is not growing at the same rate. This creates fundamental challenges for training neural networks within limited memory environments. In this work, we propose ActNN, a memory-efficient training framework that stores randomly quantized activations for back propagation. We prove the convergence of ActNN for general network architectures, and we characterize the impact of quantization on the convergence via an exact expression for the gradient variance. Using our theory, we propose novel mixed-precision quantization strategies that exploit the activation's heterogeneity across feature dimensions, samples, and layers. These techniques can be readily applied to existing dynamic graph frameworks, such as PyTorch, simply by substituting the layers. We evaluate ActNN on mainstream computer vision models for classification, detection, and segmentation tasks. On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
翻译:神经网络模型的日益扩大对于提高准确性至关重要, 但设备记忆量并没有以同样的速度增长。 这给在有限的记忆环境中培训神经网络带来了根本性的挑战。 在这项工作中, 我们提议了 ACPN, 这是一种随机储存反传播的量化激活的记忆高效培训框架。 我们证明ACPN 与一般网络结构的趋同, 我们通过精确表达梯度差异来说明四分位化对趋同的影响。 我们用我们的理论, 提出了新的混合精度量化战略, 利用激活的特性、 样本和层之间的异质性。 这些技术可以很容易地应用到现有的动态图形框架, 如 PyTorrch, 只需替换层即可 。 我们评估了主流计算机愿景模型用于分类、 检测和分级任务 。 在所有这些任务中, ACPN 将激活平均压缩到两位, 精确度损失微小。 ACPN 将激活的记忆足迹减少 12x, 并且它使得培训能够以6.x 至 14x 大小 。