利用低扭曲利用低成本目标估计值改进发言能力 (Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement)

A promising approach for multi-microphone speech separation involves two deep neural networks (DNN), where the predicted target speech from the first DNN is used to compute signal statistics for time-invariant minimum variance distortionless response (MVDR) beamforming, and the MVDR result is then used as extra features for the second DNN to predict target speech. Previous studies suggested that the MVDR result can provide complementary information for the second DNN to better predict target speech. However, on fixed-geometry arrays, both DNNs can take in, for example, the real and imaginary (RI) components of the multi-channel mixture as features to leverage the spatial and spectral information for enhancement. It is not explained clearly why the linear MVDR result can be complementary and why it is still needed, considering that the DNNs and the beamformer use the same input, and the DNNs perform non-linear filtering and could render the linear filtering of MVDR unnecessary. Similarly, in monaural cases, one can replace the MVDR beamformer with a monaural weighted prediction error (WPE) filter. Although the linear WPE filter and the DNNs use the same mixture RI components as input, the WPE result is found to significantly improve the second DNN. This study provides a novel explanation from the perspective of the low-distortion nature of such algorithms, and finds that they can consistently improve phase estimation. Equipped with this understanding, we investigate several low-distortion target estimation algorithms including several beamformers, WPE, forward convolutive prediction, and their combinations, and use their results as extra features to train the second network to achieve better enhancement. Evaluation results on single- and multi-microphone speech dereverberation and enhancement tasks indicate the effectiveness of the proposed approach, and the validity of the proposed view.

翻译：多式话语分离有希望的办法涉及两个深层神经网络(DNN),在这两个网络中,第一个DNNN的预测目标演讲用于计算时间变化最小变异无偏差反应(MVDR)的信号数据,然后将MVDR结果用作第二个DNN预测目标演讲的附加特征。以前的研究表明,MVDR结果可以为第二个DNNN提供补充信息,以更好地预测目标演讲。然而,在固定测地阵列中,DNNNPN可以将多式混合物的真和想象(RI)部分作为利用空间和光谱信息强化功能的功能进行计算。没有明确说明为什么线性MVDDR结果可以作为补充,考虑到DNNNN和B结果使用同样的非线性过滤,DNPE的预估值和新式预估值的第二个观点可以取代MVDR的真和想象性(RI), 包括IMNPE的直线性预估测结果可以更好。