Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified architecture that can handle the increased number of speakers. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.
翻译:过去几年来,单一频道的语音分离取得了巨大进展,然而,对大量发言者(例如10多个发言者)进行神经语音分离培训,目前的方法无法采用,这些方法依赖于变异性变异性损失(PIT),在这项工作中,我们提供了一种变异性培训,采用匈牙利算法,以便用3美元的时间复杂度来培训,与基于PIT方法的1美元(C)相比,其发言者数为1美元。此外,我们提出了一个经修改的结构,可以处理增加的发言者人数。我们的方法将发言者分为最多20美元,并大幅度地改进以前大额C$的成绩。