Spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations, but have difficulty incorporating a detailed speech/noise model. In contrast, LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings. This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method. By doing so, it attains both the spatial separation performance and generality of multi-channel spatial clustering and the signal modeling performance of multiple parallel single-channel LSTM speech enhancers. Our experiments show that when our system is applied to the CHiME-3 dataset of noisy tablet recordings, it increases speech quality as measured by the Perceptual Evaluation of Speech Quality (PESQ) algorithm and reduces the word error rate of the baseline CHiME-3 speech recognizer, as compared to the default BeamformIt beamformer.
翻译:空间集群技术可以在相对任意的麦克风组合中实现显著的多声道噪音减少,但难以纳入详细的语音/噪音模型。相比之下,LSTM神经网络成功地接受了培训,以识别单声道输入的噪音所产生的声音,但很难充分利用多声道录音中的信息。本文综合了这两种方法,即培训LSTM语言模型,以清理基于模型的EM源分离和本地化空间集群方法产生的口罩。通过这样做,它实现了多声道空间集群的空间分离性能和一般性能,以及多个平行的单声道LSTM语音增强器的信号模型性能。我们的实验表明,当我们的系统应用到噪音平板录音的CHIME-3数据集时,根据语音质量概念评价算法测量的语音质量会提高,并降低了基准CHiME-3语音识别器的字差差率,而默认的BeamformIT是默认的。