Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.
翻译:变换器展示了语音分离的先进性能,这得益于其捕捉全球特征的能力。 但是,在语音分离中捕捉本地特征和音频序列的频道信息同样重要。 在本文中,我们展示了一种名为 Intra-SE- Connect and Intratratrax(ISCIT) 的语音分离新颖方法。 具体而言,我们设计了一个新的网络SE- Contrax(ISCIT), 可以建模多个层面和尺度的音频序列,并将其应用到双向语音分离框架。 此外,我们提议多锁特征聚合,通过有选择地利用分隔网络中间区的信息来改善分离效果。 同时,我们提议使用一个相似的表达式歧视性损失来优化语音分离模式,以便在发言者有类似声音时解决表现不佳的问题。 WSJ0-2mix 和 WHAM 的基准数据集的实验结果显示, ISCIT可以实现最先进的结果。</s>