FlexQuant：面向大语言模型量化的灵活高效动态精度切换框架 (FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization)

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.

翻译：大语言模型的快速发展加剧了内存瓶颈问题，这源于模型参数规模与硬件能力之间日益扩大的差距。尽管训练后量化技术能有效降低内存开销，但现有方法主要依赖静态量化策略，难以适应动态工作负载。为此，我们提出FlexQuant，一种动态精度切换框架，旨在优化推理速度与精度之间的权衡。通过利用模型困惑度熵和Kullback-Leibler散度，FlexQuant实现了细粒度的层间混合精度量化，并在每个词元生成过程中动态调整位宽。该框架提供了量化策略的全面分析，引入了用于最优切换的精度需求模型，并实现了高效的细粒度精度管理。评估结果表明，FlexQuant在多种语言任务上实现了1.3倍的端到端加速，且带来的精度损失可忽略不计。本框架为大语言模型的高效部署提供了灵活自适应的解决方案。代码发布于https://github.com/ZongwuWang/FlexQuant.git。