The recent advancements in self-supervised learning, combined with the Transformer architecture, have enabled natural language processing (NLP) to achieve remarkably low perplexity. However, powerful NLP models necessitate increasing model size, leading to substantial computational and memory requirements. In this paper, we introduce an efficient inference framework tailored for large-scale generative language models. To reduce the model size, we employ a weight-only quantization strategy while preserving full precision for activations. As a result, we attain sub-4-bit quantization for each weight through non-uniform or uniform quantization techniques. Our proposed kernel, called LUT-GEMM, then accelerates quantized matrix multiplications, offering a flexible balance between compression ratio and accuracy. Unlike earlier matrix multiplication kernels that accommodated weight-only quantization, LUT-GEMM efficiently eliminates the resource-demanding dequantization process for both uniform and non-uniform quantization methods. By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations, which decreases the number of required GPUs. For the OPT-175B model with 3-bit quantization, we show that LUT-GEMM accelerates the latency for generating each token by 2.1x compared to OPTQ, which requires costly dequantization. Consequently, LUT-GEMM enables inference of the OPT-175B model on a single GPU without noticeable degradation in accuracy or performance, while the non-quantized OPT-175B model requires a minimum of 8 GPUs.
翻译:近期自监督学习的进展,加上Transformer结构,使得自然语言处理(NLP)在困惑度上取得了显著降低。然而,强大的NLP模型需要不断增加的模型大小,导致计算和存储需求的大幅增加。本文提出了一个针对大规模生成式语言模型的高效推理框架,以减少模型大小为目标,我们采用了仅限于权重的量化策略,同时保留激活的全精度。结果,我们通过非均匀或均匀量化技术,获得了每个权重的 sub-4-bit 量化。我们的提出的名为LUT-GEMM的核心加速量化矩阵乘法,提供了灵活的压缩比与精度之间的平衡。与早期只支持权重量化的矩阵乘法核不同,LUT-GEMM为均匀和非均匀量化方法高效消除了资源消耗的去量化过程。通过通过低位量化和高效的基于LUT的操作实现高压缩比,LUT-GEMM降低了GPU的数量需求,在大规模语言模型推理过程中,LUT-GEMM可以显著提高性能。我们展示了对于3-bit量化的OPT-175B模型,相比需要昂贵的去量化的OPTQ,LUT-GEMM加速了每个令牌生成的延迟2.1倍。因此,在单个GPU上推理OPT-175B模型时,LUT-GEMM可以实现与未量化的OPT-175B模型相当的精度和性能,而未量化的OPT-175B模型至少需要8个GPU。