The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.
翻译:大语言模型的快速发展加剧了内存瓶颈问题,这源于模型参数规模与硬件能力之间日益扩大的差距。尽管训练后量化技术能有效降低内存开销,但现有方法主要依赖静态量化策略,难以适应动态工作负载。为此,我们提出FlexQuant,一种动态精度切换框架,旨在优化推理速度与精度之间的权衡。通过利用模型困惑度熵和Kullback-Leibler散度,FlexQuant实现了细粒度的层间混合精度量化,并在每个词元生成过程中动态调整位宽。该框架提供了量化策略的全面分析,引入了用于最优切换的精度需求模型,并实现了高效的细粒度精度管理。评估结果表明,FlexQuant在多种语言任务上实现了1.3倍的端到端加速,且带来的精度损失可忽略不计。本框架为大语言模型的高效部署提供了灵活自适应的解决方案。代码发布于https://github.com/ZongwuWang/FlexQuant.git。