Deep neural networks (DNNs) have greatly benefited direction of arrival (DoA) estimation methods for speech source localization in noisy environments. However, their localization accuracy is still far from satisfactory due to the vulnerability to nonspeech interference. To improve the robustness against interference, we propose a DNN based normalized time-frequency (T-F) weighted criterion which minimizes the distance between the candidate steering vectors and the filtered snapshots in the T-F domain. Our method requires no eigendecomposition and uses a simple normalization to prevent the optimization objective from being misled by noisy filtered snapshots. We also study different designs of T-F weights guided by a DNN. We find that duplicating the Hadamard product of speech ratio masks is highly effective and better than other techniques such as direct masking and taking the mean in the proposed approach. However, the best-performing design of T-F weights is criterion-dependent in general. Experiments show that the proposed method outperforms popular DNN based DoA estimation methods including widely used subspace methods in noisy and reverberant environments.
翻译:深神经网络(DNN)对于在噪音环境中语音源本地化的抵达估计方法(DoA)有很大帮助。然而,由于容易受到非语音干扰,其本地化准确性仍然远远不能令人满意。为了提高抵御干扰的稳健性,我们提议基于DNN的标准化时间频率加权标准,以最大限度地减少候选方向矢量与T-F域过滤的快照之间的距离。我们的方法不需要eigendecompose,而是使用简单的正常化,以防止优化目标被噪音过滤的快照误用。我们还研究了由DNN所指导的T-F重量的不同设计。我们发现,复制Hadmard的语音比例面具产品比其他技术(如直接遮蔽和采用拟议方法的平均值)非常有效,而且更好。然而,T-F重量的最佳设计一般取决于标准。实验表明,拟议的方法比流行的DNNNU(D)估算方法要好,包括噪音和回动环境中广泛使用的次空间方法。