The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. We demonstrate that nuQmm can accelerate the inference speed of the GPT-3 (175B) model by about 14.4 times and save energy consumption by 93%.
翻译:最近与变异器结构有关的自我监督学习的进步使自然语言处理(NLP)呈现出极低的复杂度。 这些强大的模型要求不断增大的模型大小,从而需要大量的计算和记忆足迹。 在本文中,我们建议为大规模基因化语言模型建立一个高效的推论框架。 作为缩小模型大小的关键,我们用非单一四分法对重量进行量化。 然后,我们提议的内核(称为核Qmm)加速了量化矩阵的倍增,从而使得压缩率和精确度之间能够进行广泛的权衡。我们提议的核Qmm不仅降低了每个GPU的长度,而且降低了大型LMS的整个推论,因为高压缩率(以低位四分法计算)降低了所需最小的GPT-3(175B)模型的重量。我们证明,核Qmm可以使GPT-3(175B)模型的推论速度加快约14.4倍,节省了93%的能源消耗量。