Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean sources enables application in noisy scenarios. Extensive experiments on Libri2Mix show that the proposed method achieves promising results without access to any clean sources (11.06dB SI-SDRi). With a domain adaptation, our approach even outperformed supervised framework in a cross-domain evaluation on AISHELL-1.
翻译:主导研究采用有监督的语音提取培训,但很少考虑理想清洁物质和频道不匹配问题。为此,我们建议使用目标源、招生量和目标估计的语音特性的一致性,对低音量提取器的培训进行微弱的监督。在SAMOM中,投入是通过混合不同有声量的混合混合物(SAMS)构建的,每种配方都包含有已知身份和可获取的读取量的多位发言者。通过录用语句,目标演讲从一个输入器中提取,这样,根据身份一致性,在重新组合后,估计目标能够接近原始的合成的混合物(SAMMM)。此外,在有一定数量的清洁来源的半封闭环境中,使用SAMM(半封闭式),可以在噪音情景中应用。Libri2Mix(Libri2Mix)的大规模实验显示,拟议的方法在无法获得任何清洁来源的情况下取得了有希望的结果(11.06dB SI-SDRI)。在对域进行了调整后,我们的方法甚至超越了对AISELL-1进行交叉评价时的监督框架。