Speaker-independent speech separation has achieved remarkable performance in recent years with the development of deep neural network (DNN). Various network architectures, from traditional convolutional neural network (CNN) and recurrent neural network (RNN) to advanced transformer, have been designed sophistically to improve separation performance. However, the state-of-the-art models usually suffer from several flaws related to the computation, such as large model size, huge memory consumption and computational complexity. To find the balance between the performance and computational efficiency and to further explore the modeling ability of traditional network structure, we combine RNN and a newly proposed variant of convolutional network to cope with speech separation problem. By embedding two RNNs into basic block of this variant with the help of dual-path strategy, the proposed network can effectively learn the local information and global dependency. Besides, a four-staged structure enables the separation procedure to be performed gradually at finer and finer scales as the feature dimension increases. The experimental results on various datasets have proven the effectiveness of the proposed method and shown that a trade-off between the separation performance and computational efficiency is well achieved.
翻译:近年来,随着深层神经网络(DNN)的开发,独立言论分离的演讲人获得了显著的成绩。从传统的进化神经网络(CNN)和经常性神经网络(RNN)到先进的变压器(RNN),各种网络结构的设计都非常简单,目的是提高分离性能,然而,最先进的模型通常在计算方面有若干缺陷,如模型大小大、记忆消耗量大、计算复杂等。为了在性能和计算效率之间找到平衡,并进一步探索传统网络结构的建模能力,我们把RNN和新提议的变异的变体结合起来,以应对语言分离问题。通过将两个变异体嵌入这一变体的基本块,在双向战略的帮助下,拟议的网络可以有效地学习当地信息和全球依赖性。此外,一个四级结构使分离程序能够随着特征层面的提高而逐步在精细和精细的尺度上进行。各种数据集的实验结果证明了拟议方法的有效性,并表明分离性表现与计算效率之间的交易是完全实现的。