This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. In the design of MGF, speech hierarchy is taken into consideration. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales. For phoneme-scale learning, we borrow idea from the masked language model but tailor it for the continuous speech signal by replacing classification loss with a contrastive loss. We corroborate our design by evaluating MGF representation on various downstream tasks, including phoneme classification, speaker classification, speech recognition, and emotion classification. Experiments verify that training at different time scales needs different training targets and loss functions, which in general complement each other and lead to a better performance.
翻译:本文为通用语言代表制学习提供了一个自监督的学习框架,名为MGF。在设计MGF时,考虑到语言等级。具体地说,我们提议采用基因化学习方法,在小时间尺度上捕捉精细信息,在大时间尺度上采用歧视性学习方法,蒸馏粗皮或语义信息。对于电话比例学习,我们从隐蔽语言模式中借用想法,但通过用对比性损失取代分类损失来调整持续语音信号。我们通过评价MGF在各种下游任务(包括电话网分类、语音分类、语音识别和情感分类)上的代表性来证实我们的设计。实验证实,不同时间尺度的培训需要不同的培训目标和损失功能,一般来说,它们相互补充,并导致更好的业绩。