Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further introduce query and key-value pooling mechanisms that can be applied to each transformer layer for further compression. Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting. We further show that we can fine-tune the same stochastically pre-trained model to a specific configuration to recover the WER difference resulting in significant computational savings on pre-training models from scratch.
翻译:挤压和高效 Wav2vec (SEW) 是最近提出的一个结构, 将输入到变压器编码器中, 以便用 wav2vec 2. 0 (W2V2) 模型来计算高效的预培训前和推算值。 在这项工作中, 我们建议对W2V2 模型按需计算减压进行抽压压缩。 相对于使用固定的挤压因子, 我们在培训中统一抽样。 我们进一步引入查询和关键价值集合机制, 可以应用到每个变压层进一步压缩。 我们在960h Librispeech 预培训的模型和对10h 转录数据进行微调的结果表明, 使用相同的变压模型, 我们得到一个顺畅的换, 单词误率(WER) 和推回时间, 与W2V2 和为特定环境而培训的SEW模型相比, 仅存在微值的WER 降解。 我们还表明, 我们可以微调同一的先导型模型到特定配置中, 以恢复在前制动模型上的重大计算节省的WER差异。