Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
翻译:自监督学习有效地利用无标签数据,提高标签效率和推广到没有标记数据的领域。虽然最近的研究已经研究了更多的声学/语言领域、语言和模态的一般化,但是这些研究仅限于单源语音,其中录音中只有一个主要说话者。本文提出了Cocktail HuBERT,一种自监督学习框架,它通过掩盖伪源分离目标来推广到混合语音。这个目标鼓励模型识别源的数量,分离和理解上下文,并推断表示为发现的单元的掩盖区域的内容。Cocktail HuBERT在多扬声器ASR上的WER降低了69%,在音频分离上的DER降低了31%,并在SUPERB的单和多扬声器任务上具有竞争力。