Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0
翻译:Wav2vec-C引入了一种新型的演示学习技术,其中结合了wav2vec2.0和VQ-VAE的元素。我们的模型学会了以与Wav2vec2.0类似的方式复制部分遮盖式语音编码的量化表达式。然而,量化过程通过一个额外的一致性网络实现正规化,该网络学习以类似于VQ-VAE模型的方式从量化表达式重建 wav2vec 2.0 网络的输入特征。拟议的自我监督模型以10千小时的未贴标签数据进行了培训,随后在RNN-T ASR模型中使用了部分遮盖式语音编码器,并以1千小时的贴标签数据进行微调。这项工作只是用大量真实远处标签数据的语音任务自我监督学习的少数研究之一。Wav2vec-C编码表达式平均实现了两倍的基线误差减少值和更高的代码簿使用率,与 wav2vec 2.0相比是2美元。