In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA. Combined with end-to-end fine-tuning our method exceeds or is on par with previous state-of-the-art methods in terms of the trade-off between accuracy and model size. Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
翻译:在本文中,我们引入了一种新的神经网络重量压缩方法。 在我们的方法中,我们将重振成稀有的、量化的矩阵因子存储起来,其产品在推算中以苍蝇计算产生目标模型的重量。我们使用预测的梯度下降方法来寻找重振量的量化和稀散的因子化。我们表明,这种方法可以被视为SVD、矢量量化和稀散的五氯苯甲醚的重量的统一。结合端到端微调,我们的方法在精确度和模型大小之间的权衡方面超过或与以前最先进的方法相同。我们的方法适用于中度压缩制度,不同于矢量量化和极端压缩制度。