Deep learning has shown a great potential for speech separation, especially for speech and non-speech separation. However, it encounters permutation problem for multi-speaker separation where both target and interference are speech. Permutation Invariant training (PIT) was proposed to solve this problem by permuting the order of the multiple speakers. Another way is to use an anchor speech, a short speech of the target speaker, to model the speaker identity. In this paper, we propose a simple strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation. Specifically, we insert a short speech of target speaker at the beginning of a mixture as guide information. So, the first appearing speaker is defined as the target. Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech. Experimental results show that the proposed training strategy is effective for speaker separation.
翻译:深层学习显示出了语言隔离的巨大潜力,特别是语言隔离和非语言隔离。然而,它遇到了多语种分离的变异问题,目标与干扰都是语言分隔。建议了变异性培训(PIT),通过改变多个发言者的顺序来解决这个问题。另一种方法是使用锁定演讲(目标发言者的简短发言)来模拟发言者身份。在本文中,我们提出了一个简单的战略,以训练一个长期短期内存(LSTM)模式来解决发言者分离中的变异问题。具体地说,我们在组合开始时插入一个目标发言者的简短发言,作为指导信息。因此,第一个上映的发言者被定义为目标。由于在顺序建模方面的强大能力,LSTM可以使用其记忆细胞跟踪和区分目标演讲与干扰演讲。实验结果显示,拟议的培训战略对于分隔发言者是有效的。