Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction. It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations. In contrast, spatial clustering techniques can achieve such generalization, but lack a strong signal model. This paper combines the two approaches to attain both the spatial separation performance and generality of multichannel spatial clustering and the signal modeling performance of multiple parallel single-channel LSTM speech enhancers. The system is compared to several baselines on the CHiME3 dataset in terms of speech quality predicted by the PESQ algorithm and word error rate of a recognizer trained on mis-matched conditions, in order to focus on generalization. Our experiments show that by combining the LSTM models with the spatial clustering, we reduce word error rate by 4.6\% absolute (17.2\% relative) on the development set and 11.2\% absolute (25.5\% relative) on test set compared with spatial clustering system, and reduce by 10.75\% (32.72\% relative) on development set and 6.12\% absolute (15.76\% relative) on test data compared with LSTM model.
翻译:使用 LSTM 结构的经常性神经网络可以实现显著的单声道噪声降幅。 但是,如何将其应用到多声道输入中,其方式并不明显,可以推广到新的麦克风配置中。相反,空间集群技术可以实现这种普遍化,但缺乏强大的信号模型。本文结合了两种方法,既可以实现空间分离性能和多声道空间集群的通用性能,也可以实现多声道空间集群的多平行单声道LSTM 语音增强器的信号性能建模。该系统与CHime3 数据基数的若干基线进行比较,这些基数是根据PESQ 算法预测的语音质量和在不匹配条件下受过训练的识别器的字差率预测的,以便侧重于一般化。我们的实验表明,通过将LSTM 模型与空间集群相结合,我们将开发集的字差率降低4.6 绝对值(17.2 ⁇ 相对),与空间集群系统的测试组数减少11.2 绝对值(25.5 ⁇ 相对),并将开发数据集和绝对性模型减少10.75 和6. 相对性(15.76 相对) 试验数据比L.76 。