We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. This method, DiffQ, is differentiable both with respect to the unquantized parameters, and the number of bits used. Given a single hyper-parameter expressing the desired balance between the quantized model size and accuracy, DiffQ can optimize the number of bits used per individual weight or groups of weights, in a single training. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the Wikitext-103 language modeling benchmark, DiffQ compresses a 16 layers transformer model by a factor of 8, equivalent to 4 bits precision, with a loss of 0.3$\%$ in model accuracy. Code is available at: https://github.com/facebookresearch/diffq
翻译:我们提议在培训期间在模型参数中增加独立的伪量化噪声,以近似量化操作员的效果。这个方法DiffQ(DiffQ)在未量化参数和使用比特数的数量上都是不同的。如果用一个超参数来表示量化模型大小和精确度之间的预期平衡,DiffQ(DiffQ)可以在一次培训中优化每个重量或数组重量使用的比特数。我们实验性地核查我们的方法在图像分类、语言建模和音频源分离的若干基准和结构方面优于最先进的量化技术。例如,在Wikiptext-103语言建模基准上,DiffQ(DiffQ)将16层变压器模型压缩为8倍,相当于4位精确度,在模型精确度上损失0.3美元。代码见:https://github.com/facereearchear/diffq(https://githuub.com/faceresearch/diffq)。