This work proposes a multichannel speech separation method with narrow-band Conformer (named NBC). The network is trained to learn to automatically exploit narrow-band speech separation information, such as spatial vector clustering of multiple speakers. Specifically, in the short-time Fourier transform (STFT) domain, the network processes each frequency independently, and is shared by all frequencies. For one frequency, the network inputs the STFT coefficients of multichannel mixture signals, and predicts the STFT coefficients of separated speech signals. Clustering of spatial vectors shares a similar principle with the self-attention mechanism in the sense of computing the similarity of vectors and then aggregating similar vectors. Therefore, Conformer would be especially suitable for the present problem. Experiments show that the proposed narrow-band Conformer achieves better speech separation performance than other state-of-the-art methods by a large margin.
翻译:这项工作提议采用由窄带配对者组成的多频道语音分离法(简称NBC),对网络进行培训,学习自动利用窄带语音分离信息,如多发言者的空间矢量分组,具体来说,在短时四级变换(STFT)域,网络独立处理每个频率,由所有频率共享。对于一个频率,网络输入多频道混合信号的STFT系数,并预测分离语音信号的STFT系数。空间矢量分组在计算矢量的相似性并随后聚集类似的矢量方面与自控机制有着类似的原则。因此,Conex将特别适合当前的问题。实验显示,拟议的窄带组合通过大边缘实现比其他最先进的语言分离方法更好的语音分离性能。