The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT- and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60% smaller and 30% faster. Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture, and outperforms all prior BERT-base compression techniques like CoFi, MiniLM and TinyBERT.
翻译:大型语言模型(LLMS)的突破性表现伴随着大量的计算足迹和高部署成本。 在本文件中,我们通过提出一种称为ZipLM(ZipLM)的LLM(LLM)新的结构化压缩方法,为LLM(称为ZipLM)提出新的结构化压缩方法,提供最先进的压缩-Vs准确性结果,同时保证在任何特定目标硬件上匹配一套(可实现的)目标加速。具体地说,根据一项任务、一个模型、推论环境以及一套加速目标,ZipLM通过模型重量矩阵的迭代结构缩缩缩缩来查明并消除模型中的冗余。重要的是,ZipLM(ZipLM)在培训后/一发式压缩技术和逐步压缩设置两方面都同时提供一套精确模型,使其在任何特定目标硬件上高度高效。我们的方法基于新的结构化理算和知识蒸馏技术,在BERT和GPT-F-FM(G-PT-F-F-F-F)模型实验中的所有结构缩缩缩缩缩缩缩缩缩缩缩方法。 具体地将G-FILM(B)的缩缩缩缩缩缩成的缩缩图,同时通过SFILB)的缩缩缩缩缩缩的缩的缩缩缩成的缩的缩缩图。