A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint - Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90\% of the intermediate tensor elements in fully-connected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g., classification, object detection, instance segmentation). Our code and models are available at https://github.com/chenjoya/dropit.
翻译:在训练深神经网络时,一个标准的硬件瓶颈是 GPU 内存。 大部分的内存都由中间热量缓冲的中间热量占用, 用于后向通道的梯度计算。 我们建议一种新颖的方法来减少这一足迹—— 丢弃中间热量( DropIT ) ; 丢弃中间热量的微克元素, 以及从后向通道的隔热量中大约的梯度梯度。 从理论上讲, 滴滴滴会减少估计的梯度上的噪音, 因此比香草- SGD 的趋同率要高。 实验显示, 在完全连接和相联的层中, 我们可以将中间高至 90 ° 的 。 同时, 在各种任务( 例如, 分类、 对象检测、 例分解) 上, 我们的代码和模型可以在 https://github. com/ chenjoya/ dropit 上找到 。</s>