We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.
翻译:我们研究变异性培训(变异性培训),该培训针对语言独立源分离模型的变异性模棱两可问题。我们扩展了两种最先进的PIT战略。首先,我们审视了最初为STFT域提议的基于框架级PIT(tPIT)和集群的两阶段语音分解和跟踪算法,并调整了该算法,以适应波形和已学过的潜在空间的工作。此外,我们提议了一种高效的组合损失损失可与波形模型相适应。第二,我们扩展了最近提出的具有深层特征损失的辅助语音-ID损失,其依据是“问题、不可知的语音特征特征”,以减少PIT(uPIT)在本地造成的变异性错误。我们的结果显示,拟议的扩展有助于减少变异性模糊性。但我们还注意到,所研究的STFT模型在减少变异性误差方面比波形模型更有效,在最近的研究中忽略了这一视角。