We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.
翻译:我们为语音识别提供了一个简单而有效的自我监督学习方法。 这种方法学习了一种模型来预测隐性语音信号, 其形式是随机投射量计生成的离散标签。 特别是, 量子仪用随机初始矩阵进行语音输入, 并在随机初始代码簿中进行近邻查找。 矩阵和代码簿在自我监督学习期间都没有更新。 由于随机投射量量计没有经过培训, 并且与语音识别模型分开, 设计使该方法具有灵活性, 并且与通用语音识别结构兼容。 在 LibriSpeech 中, 我们的方法实现了类似于先前使用非流模式自我监督学习的工作的单词仪率和通量, 并且提供了比 wav2vec 2.0 和 w2v- BERT 更低的流式模式。 在多语言化任务方面, 该方法还提供了与 wv2vec 2. 0 和 w2v- BERT 的显著改进。