We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal information for separation. The model is trained to perform complex spectral mapping, where the real and imaginary (RI) components of the input mixture are stacked as input features to predict the target RI components. Besides using the scale-invariant signal-to-distortion ratio (SI-SDR) loss for model training, we include a novel loss term to encourage the separated sources to add up to the input mixture. Without using dynamic mixing, we obtain 23.4 dB SI-SDR improvement (SI-SDRi) on the WSJ0-2mix dataset, outperforming the previous best by a large margin.
翻译:我们提议TF-GridNet,这是一个在时频(T-F)域内运行的新颖的多路深神经网络(DNN),用于在厌食条件下独立调音器。模型堆叠了多个多路块,每个区块包括一个内部光谱模块、一个亚频带时间模块和一个全波自控模块,以利用本地和全球光谱-时空信息进行分离。该模型经过培训,可以进行复杂的光谱测绘,输入混合物的真和假(RI)部分被堆叠成预测目标光学分离元件的输入特性。除了在模型培训中使用规模变化信号-扭曲比率(SI-SDR)损失外,我们还包括一个新的损失术语,鼓励分离源与输入混合物相加。不使用动态混合,我们在WSJ0-2mix数据集上获得23.4 dB SI-SDR改进(SI-SDRI),比以往最佳的数据元值高出一个大边缘。