Despite the success of Transformers in self-supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, prior to applying them to downstream tasks. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report wall-clock time, the number of parameters, and the number of multiply-accumulate operations for these techniques, charting the landscape of compressing Transformer-based self-supervised models.
翻译:尽管变形者成功地在自我监督的学习中成功地应用了各种下游任务,但培训和推论的计算成本仍然是将这些模型应用到范围广泛的各种装置上的一项重大挑战。在将这些模型应用到下游任务之前,曾几次单独尝试压缩变形器。在这项工作中,我们的目标是为孤立的结果提供背景,研究几种常用的压缩技术,包括重量裁剪、头裁剪、低级近距离和知识蒸馏。我们报告的是倒时时间、参数数量以及这些技术的倍增累积操作数量,绘制压缩变形器基于自我监督的压缩模型的景观图。