We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
翻译:我们提出了一个单一频道语音分离任务的上限,该任务基于对短篇发言性质的假设。我们利用这一约束,能够表明,虽然最近的方法对少数发言者来说已经取得重大进展,但对于5位和10位发言者来说仍有改进的余地。然后我们推出一个深神经网络,SepIT, 迭接地改进了不同发言者的估计。在试验时间,SpeIT根据我们分析得出的相互信息标准,每个试样的迭代次数不同。在一系列广泛的实验中,SepIT超越了2、3、5和10位发言者的最新神经网络。