Transformer-based models, represented by GPT-3, ChatGPT, and GPT-4, have recently attracted increasing interest, research enthusiasm, and business demand. However, their massive computation resources and huge memory footprint are inevitable challenges. To tackle this issue, we propose BCT, a framework of blockwise compression for transformers without retraining, to lower deployment thresholds. BCT achieves more fine-grained compression of the whole transformer, including embedding, matrix multiplication, GELU, Softmax, layer normalization, and all the intermediate results. As a case, we compress an efficient model with BCT and evaluate it on several General Language Understanding Evaluation (GLUE) datasets. The results show that BCT can achieve a less than 0.90% accuracy drop in most tasks.
翻译:Transformer-based模型,如GPT-3,ChatGPT和GPT-4,最近引起了越来越多的关注、研究热情和商业需求。然而,它们巨大的计算资源和巨大的内存占用是不可避免的挑战。为了解决这个问题,我们提出了一个名为BCT的框架,它可以在不重新训练的情况下对transformers进行分块压缩,从而降低部署门槛。BCT实现了整个transformer的更加细粒度的压缩,包括嵌入、矩阵乘法、GELU、Softmax、层归一化和所有中间结果。作为一个案例,我们使用BCT压缩了一个高效模型,并在几个General Language Understanding Evaluation (GLUE)数据集上进行了评估。结果显示BCT在大多数任务中可以实现少于0.90%的精度下降。