The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for efficient patterns to store and access tensors. FlexGen further compresses these weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
翻译:大型语言模型(LLM)的计算和内存要求高,传统上只有多端高加速器才可能实现。本文件的动力是,对分批处理的低温敏感任务的需求正在出现,因此,利用有限的资源,如单一商品GPU,开始研究高通量LLM推断,我们介绍了高通量的LLGen,这是一个高通量生成引擎,用于运行具有有限GPU内存的LLMS。FlexGen可以在各种硬件资源限制下灵活配置,方法是将GPU、CPU和磁盘的记忆和计算集中起来。它通过线性编程优化,寻找高效的存储和存取高压任务模式。FlexGen进一步压缩这些重量,将注意力集中在4位,而精度损失微小。这些技术使FlexGen拥有更大的批量选择空间,从而大大增加了最大通量的吞吐量。因此,在单一16GBGPU上运行O-175B时,FlexGen能比第一个州级的存储量要高得多。在GGG-Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</s>