Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and frequency domain multichannel speech separation. The proposed model consists of two modules: separation and beamforming. Both modules perform temporal-spectral-spatial modeling and are trained from end-to-end using a joint loss function. The novelty of this study lies in two folds. Firstly, a time domain directional feature conditioned on the direction of the target speaker is proposed, which can be jointly optimized within the time domain architecture to enhance target signal estimation. Secondly, an all-neural beamforming network in time domain is designed to refine the pre-separated results. This module features with parametric time-variant beamforming coefficient estimation, without explicitly following the derivation of optimal filters that may lead to an upper bound. The proposed method is evaluated on simulated reverberant overlapped speech data derived from the AISHELL-1 corpus. Experimental results demonstrate significant performance improvements over frequency domain state-of-the-arts, ideal magnitude masks and existing time domain neural beamforming methods.
翻译:最近,频域全神经波形方法在多通道语音分离方面取得了显著的进展。同时,时间域网络结构和波束成形的整合也引起了人们的极大关注。本研究提出了在时间域中全自然波形新颖的方法,并试图统一时间域和频率域域多频道语音分离的所有神经波形管道。拟议的模型由两个模块组成:分离和波形成型。两个模块都执行时光谱模型,并且使用联合丢失功能从端到端对端进行培训。本研究的新颖性在于两个域折叠。首先,提出了以目标演讲者方向为条件的时间域方向的时域方向特征,可以在时间域架构中共同优化,以加强目标信号估计。第二,在时间域中,一个全神经波形成型网络的设计目的是改进前隔析的结果。这个模块具有对时空差的参数对系数估计,而没有在生成最佳语音镜像过滤器后明确进行该模型的转换为高级频程数据。对A-BRA模型的改进方法进行了测试。