Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
翻译:量化方法减少了在模型中代表每个参数所需的比特数数量,将精确度转换成较小的记忆足迹和推推延迟。然而,最终模型大小取决于原始模型参数的数量和压缩率。例如,30B 8比特模型和60B 4比特模型的比特数数量相同,但可能具有非常不同的零射偏差。在这项工作中,我们通过在大语言模型中制定零射性能的推论测量法来研究这一权衡,以确定使零弹性能最大化的比特精度和模型大小。我们用16比特投入和k比特参数进行超过35 000次零射试验,以检查在19M至66比特标准范围内,在LLOM家族BLOM、ALMO、NoX/Pythia和GPT-2之间,哪些四位化方法改进了3至8比特精确度。我们发现改进大语言模型零弹分级贸易的比特度和模型规模,只有利用小块精确度的改进,几乎是小块的精确度,而独立地使用了小块的平面图。