This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
翻译:本文是对预先培训的自动语音识别模型(ASR)绩效-效率权衡的研究。 我们侧重于 wav2vec 2. 0, 并正式确定影响模型性能及其效率的若干建筑设计。 结合我们的所有观察,我们引入了SEW(挤压和高效Wav2vec),这是一个预先培训的模型结构,在各种培训设置的绩效和效率方面都有重大改进。 例如,在对LibriSpeech的100h-960h半监督设置下,SEW实现了1. 9x的推论速度,而Wav2vec 2. 0, 与Wav2vec 2. 0 相比,单词错误率相对减少了13.5%。 在类似的推论时间,SEW在不同模式规模下将字误率降低25-50%。