Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.
翻译:内存足迹是大型神经网络培训的主要限制因素之一。 在后向偏移中, 需要将输入存储到计算图中的每个操作中。 每个现代神经网络模型在其结构中都有相当少的点向非线性, 而正如我们所显示的那样, 这样的操作会引发额外的内存成本, 通过梯度的量化可以大大降低这些成本。 我们提出了一个系统化的方法来计算点向非线性函数留存的梯度的最佳量化, 每个元素只有几位。 我们显示, 可以通过对激活功能的衍生物进行最佳的、 节向的近似近似, 可以通过动态编程来完成。 滴入式替换适用于所有流行的非线性, 并且可以在任何现有管道中使用。 我们确认在几个开放的基准上, 内存减少和相同的趋同点 。