Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.
翻译:量化方法减少了在模型中代表每个参数所需的比特数数量,将精确度转换成较小的内存足迹和推推误延迟。然而,最后模型大小取决于原始模型参数的数量和压缩率。例如,30B 8比特模型和60B 4比特模型的比特数和60B 4比特模型的比特数数量相同,但可能有非常不同的零点偏差。在这项工作中,我们通过在大语言模型中制定零弹性能的推论比量定值来研究这一权衡,以确定使零弹性能最大化的比特精度和模型大小。我们用16位投入和k比特参数进行超过35 000次的实验,以检查哪些零弹四分法方法在19M至176比特的LLLM家族B、ALOM、ALM、NEOX/Pythia和GPT-2中改进了3到8位精确度的比特参数。我们发现,改进大语言模型的比特级贸易比值的比值和模型具有挑战性,只有改进的比特级,只有使用小块的精确度的比特,几乎可以独立地将数据转换成区块。</s>