Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.
翻译:自监督学习(SSL)是语言处理的长期目标,因为它使用大规模无标签数据,避免了广泛的人类标签。近年来,在应用自监督的语音识别学习方面取得了巨大成功。近年来,在应用自监督的语音识别学习方面取得了巨大成功,在应用自监督的语音识别学习方面,尝试了有限的探索,在应用SSL模拟演讲者特点方面,我们试图进行有限的探索。在本文件中,我们的目标是改进现有的SSL(SSL)演讲者代言学习框架。在加强不受监督的语音信息提取方面,我们采用了两种方法。首先,我们将多任务学习应用到目前的SSL框架,在这个框架中,我们将表达式的、明智的对比性损失与SSL的目标功能结合起来。第二,为了更好的语音区分,我们提出了数据扩增扩增的超音调混合战略,在培训期间,额外重叠的超音量生成并纳入。我们把拟议的方法纳入HuBERT框架。SUPERB基准实验结果显示,拟议的系统在普遍代表学习方面达到了最先进的业绩,特别是针对演讲者识别任务。正在进行一项对比研究,正在核查每一项拟议方法的效能。最后,我们将SUPER培训任务升级到94小时的数据任务升级到公共数据。