Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning theoretically. To this end, we propose a progressive multi-scale self-supervised learning (PMS-SSL) method, which uses fine-grained target sets to compute SSL loss at top layer while uses coarse-grained target sets at intermediate layers. Furthermore, PMS-SSL introduces multi-scale structure into multi-head self-attention for better speech representation, which restricts the attention area into a large scope at higher layers while restricts the attention area into a small scope at lower layers. Experiments on Librispeech dataset indicate the effectiveness of our proposed method. Compared with HuBERT, PMS-SSL achieves 13.7% / 12.7% relative WER reduction on test other evaluation subsets respectively when fine-tuned on 10hours / 100hours subsets.
翻译:自我监督学习模式(SSL)在自动语音识别方面取得了相当大的改进。此外,如果该模式用于音频内容信息理论上的学习,ASR的性能还可以进一步改进。为此,我们建议采用渐进式多级自我监督学习(PMS-SSL)方法,该方法使用细微的刻度目标组来计算顶层的SSL损失,同时在中间层使用粗微的测分目标组。此外,PMS-SSL将多比例结构引入多头自控系统,以更好地表达语言,从而将注意力区域限制在较高层的大范围内,同时将注意力区域限制在较低层的小型范围。利布里斯佩奇数据组实验显示了我们拟议方法的有效性。与HuBERT相比,PMS-SSL在对10小时/100小时子组进行微调时,通过测试其他评价子组分别达到13.7%/12.7%的相对WER降幅。