Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation without degradation in accuracy. Prior studies focus on the pruning of Transformers; however, speech models not only utilize a stack of Transformer blocks, but also combine a frontend network based on multiple convolutional layers for low-level feature representation learning. This frontend has a small size but a heavy computational cost. In this work, we propose three task-specific structured pruning methods to deal with such heterogeneous networks. Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vec2-base with 10% to 30% less computation, and is able to reduce the computation by 40% to 50% without any degradation.
翻译:自我监督的语音演示学习(SSL) 显示在各种下游任务中是有效的, 但是 SSL 模式通常是大而慢的。 模型压缩技术, 如剪裁技术, 旨在减少模型大小和计算, 且不精确。 先前的研究侧重于对变换器的剪裁; 然而, 语音模型不仅使用一堆变换器块, 而且还将基于多个共进层的前端网络组合起来, 用于低层次的特征演示学习 。 这个前端的大小较小, 但计算成本很高 。 在这项工作中, 我们建议了三种特定任务的结构化剪裁方法来处理这些混杂网络 。 有关 LibriSpeech 和 SLURP 的实验显示, 提议的方法比原始的 wav2vec2- base 更准确, 其计算率为10%到 30%, 并且能够将计算率降低40% 到 50 。</s>