Neural beamformers, integrating both pre-separation and beamforming modules, have shown impressive efficacy in the target speech extraction task. Nevertheless, the performance of these beamformers is inherently constrained by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer underpinned by a dual-path transformer. Initially, we harness the cross-attention mechanism in the time domain, extracting pivotal spatial information related to beamforming from the noisy covariance matrix. Subsequently, in the frequency domain, the self-attention mechanism is employed to bolster the model's capacity to process frequency-specific details. By design, our model circumvents the influence of pre-separation modules, delivering the performance in a more holistic end-to-end fashion. Experimental results reveal that our model not only surpasses contemporary leading neural beamforming algorithms in separation performance, but also achieves this with a notable reduction in parameter count.
翻译:暂无翻译