Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.
翻译:演讲是一组有限的语音单元的表面形式,可以用独立的代码来代表。我们提议了自我监督语音演示学习的代码 BERT (CoBERT) 方法。 其想法是将言论转换为一系列独立的代码,并进行代码演示学习,我们根据对原始语音输入的蒙面观点预测代码表达。 与以前教师和学生采用相同模式的自我蒸馏方法不同,我们的目标模型预测了不同模式的表达方式。 CoBERT 超越了对 ASR 任务的最新最新表现,并对 SUPERB 语音翻译(ST) 任务带来了重大改进。 我们的代码和模型在 https://github.com/mct10/CoBERT 上发布。