We investigate the effectiveness of convolutive prediction, a novel formulation of linear prediction for speech dereverberation, for speaker separation in reverberant conditions. The key idea is to first use a deep neural network (DNN) to estimate the direct-path signal of each speaker, and then identify delayed and decayed copies of the estimated direct-path signal. Such copies are likely due to reverberation, and can be directly removed for dereverberation or used as extra features for another DNN to perform better dereverberation and separation. To identify such copies, we solve a linear regression problem per frequency efficiently in the time-frequency (T-F) domain to estimate the underlying room impulse response (RIR). In the multi-channel extension, we perform minimum variance distortionless response (MVDR) beamforming on the outputs of convolutive prediction. The beamforming and dereverberation results are used as extra features for a second DNN to perform better separation and dereverberation. State-of-the-art results are obtained on the SMS-WSJ corpus.
翻译:我们调查的是共变预测的有效性,这是对语音偏差进行线性预测的一种新配方,目的是在反动条件下将发言者隔离开来。关键的想法是首先使用深神经网络(DNN)来估计每个发言者的直接路径信号,然后确定估计直接路径信号的延迟和衰变副本。这些副本可能由于反射而延迟和衰减。这些副本可能会被直接删除,并且可以被直接去除,或者用作另一个DNN的附加特性,以便让另一个DN更好地进行分离和分离。为了识别这些副本,我们在时频(T-F)域有效解决了每频率的线性回归问题,以估计基本室脉冲反应(RIR)。在多频道扩展中,我们执行最小的变异反应(MVDR),根据演算预测的输出而成形。波形和变异结果被用作第二个DNN的附加特性,以便进行更好的分离和分离和分离。在SMS-WJposiro上获取了国家艺术结果。