Recently studies on time-domain audio separation networks (TasNets) have made a great stride in speech separation. One of the most representative TasNets is a network with a dual-path segmentation approach. However, the original model called DPRNN used a fixed feature dimension and unchanged segment size throughout all layers of the network. In this paper, we propose a multi-scale feature fusion transformer network (MSFFT-Net) based on the conventional dual-path structure for single-channel speech separation. Unlike the conventional dual-path structure where only one processing path exists, adopting several iterative blocks with alternative intra-chunk and inter-chunk operations to capture local and global context information, the proposed MSFFT-Net has multiple parallel processing paths where the feature information can be exchanged between multiple parallel processing paths. Experiments show that our proposed networks based on multi-scale feature fusion structure have achieved better results than the original dual-path model on the benchmark dataset-WSJ0-2mix, where the SI-SNRi score of MSFFT-3P is 20.7dB (1.47% improvement), and MSFFT-2P is 21.0dB (3.45% improvement), which achieves SOTA on WSJ0-2mix without any data augmentation method.
翻译:最近对时间-地段音频分离网络(TasNets)的研究在语音分离方面迈出了一大步。最有代表性的TasNets是一个具有双路分割法的网络,然而,最初的名为DPRNNN的模型在网络的所有层次上都使用了固定的特性尺寸和不变的区块大小。在本文中,我们提议基于单一通道语音分离常规双路结构的多尺度地段聚变变变变变器网络(MSFFT-Net)。与传统的双路结构不同的是,传统的双路结构只有一种处理路径,采用若干带有替代的中环内和中环内操作的迭接区块来捕捉当地和全球背景信息的网络。拟议的MSFFT-Net有多个平行的处理路径,在多个平行处理路径之间可以交换地段信息。实验表明,我们基于多级地段融合结构的拟议网络(MSFFT-2P-Net)取得了比基准数据集-WSJ0-2mix原有的双路模式更好的结果。SFFT-3P的评分数为20.7dB(1.47%改进),而MSFFTS- 2.45是任何SOB的改进方法。