We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits used. Given a single hyper-parameter balancing between the quantized model size and accuracy, DiffQ optimizes the number of bits used per individual weight or groups of weights, in end-to-end training. We experimentally verify that our method is competitive with STE based quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the ImageNet dataset, DiffQ compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3% in model accuracy. Code is available at github.com/facebookresearch/diffq.
翻译:我们建议DiffQ为没有梯度近似值的量化模型参数提供不同的压缩模型方法(例如,直线通过模拟器) 。 我们建议在训练期间将独立的伪量化噪音添加到模型参数中,以近似量化运算器的影响。 DiffQ 在未量化的重量和使用的位数方面都是不同的。 鉴于在量化模型大小和精确度之间保持单一的超参数平衡, DiffQ在端到端训练中优化了每个重量或重量组使用的比特数。 我们实验性地核实我们的方法与基于STE的量化技术在图像分类、语言建模和音频源分离的若干基准和结构上具有竞争力。 例如,在图像网络数据集上, DiffQ 压缩一个基于12层变压器的模型,其因子大于8(平均重量精确度小于4位),模型精确度损失0.3%。 代码可在 Githhub. com/facebreadresearch/diffq查阅。