Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL setup wherein we jointly optimize both the encoder and decoder losses. We hypothesize that the presence of a decoder in the SSL model helps it learn an acoustic unit-based language model, which might improve the performance of an ASR downstream task. We compare our proposed SSL model with HuBERT and show up to 25% relative improvement in performance on ASR by finetuning on various LibriSpeech subsets.
翻译:自我监督的学习(SSL)在各种与语言相关的下游任务(包括自动语音识别(ASR))中表现出了巨大的成功。SSL模式的输出嵌入将被视为语音信号的强大短期表示。然而,在ASR的任务中,主要目标是获得音响单元、字符或字节编码的正确序列。通常,编码器-代码器架构在像ASR这样的顺序顺序顺序上运作非常顺利。因此,我们在本文件中提出一种新的模式,在自我监督学习期间利用解码器的能量。我们使用隐藏单位 BERT(HuBERT) SSL 框架来计算编码器的常规掩码预测损失。此外,我们在SLF框架中引入了解码器,并提出了解码器的目标准备战略。最后,我们使用了一个多塔斯克 SSL设置,我们共同优化了编码器和解码器损失。我们假设了在SLSLA系统(SL)下游系统(SL)下游系统(SL)系统(SL)系统(SL)下游系统(SL)下游)系统(SLB)下级(SLV)中,我们可能用SLOB(SLOB)下级(SLOB)系统(SB)系统(SLO(SL)系统(SB)的相对语言)的运行模式)的模型来学习。