Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o language model setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR.
翻译:最近,先驱工作发现,语言预修模式可以解决全堆语音处理任务,因为该模式利用底层学习与演讲者相关的信息和顶层编码内容相关信息。由于网络能力有限,我们认为,如果该模式专门用于音频内容信息学习,语音识别性能可以进一步改进。为此,我们提出自我监督学习中层监督(ILS-SSL),这迫使该模式通过在中间层增加额外的SSL损失,尽可能集中关注内容信息。LibriSpeech测试-其他设置的实验显示,我们的方法大大超越了HuBERT,在基础/大型模型的W/o语言模型设置中,实现了23.5%/11.6%相对字差率的降低。详细分析显示,我们模型的底层与电话设备有更好的关联,这与我们的直觉是一致的,并解释了我们用于ASR的方法的成功之处。