Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
翻译:通过对比式培训自我监督学习的最新进展表明,有可能学习竞争性语音识别系统,只有不到10分钟的标签数据。然而,这些系统在计算上成本很高,因为它们需要预先培训,然后在大参数空间进行微调。我们探索这些系统的性能,而不进行微调,方法是在计算要求 wav2vec 2.0 框架的固定表达式上培训最先进的语音识别器。我们发现,在极低资源环境下,Wav2vec 2.0比其前身低。此外,我们发现, wav2vec 2.0 表示器生活在一个低维度的子空间中,而装饰这些表达器的特征可以稳定自动语音识别器的培训。最后,我们建议对原Wav2vec 框架进行双向扩展,不断改进性能。