When deploying deep learning models to a device, it is traditionally assumed that available computational resources (compute, memory, and power) remain static. However, real-world computing systems do not always provide stable resource guarantees. Computational resources need to be conserved when load from other processes is high or battery power is low. Inspired by recent works on neural network subspaces, we propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models that range from highly efficient to highly accurate. Our models require no retraining, thus our subspace of models can be deployed entirely on-device to allow adaptive network compression at inference time. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. We achieve accuracies on-par with standard models when testing our uncompressed models, and maintain high accuracy for sparsity rates above 90% when testing our compressed models. We also demonstrate that our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
翻译:在向设备部署深层学习模型时,传统上假定现有计算资源(计算、内存和功率)仍为静态。然而,现实世界计算系统并不总是提供稳定的资源保障。当其他流程的负荷高或电池功率低时,计算资源需要节约。受神经网络子空间近期工程的启发,我们提出了一个对神经网络网络进行“压缩子空间”培训的方法,该网络包含精细的精细分层模型范围,从高效率到高准确度不等。我们的模型不需要再培训,因此我们的模型子空间可以完全在设备上部署,允许在推论时间进行适应性网络压缩。我们提出了在结构化和无结构的宽度的推论时间实现任意微微增精度精确率交易的结果。我们在测试我们不受压力模型时,在标准模型上实现不精确度,在测试压缩模型时保持高于90%的灵敏度率。我们还表明,我们的算法将扩大到可变小宽度的四分宽度,在单个经过训练的网络上达到精确度。