We propose a method for improving the quality of synthetic room impulse responses generated using acoustic simulators for far-field speech recognition tasks. We bridge the gap between the synthetic room impulse responses and the real room impulse responses using our novel, one-dimensional CycleGAN architecture. We pass a synthetic room impulse response in the form of raw-waveform audio to our one-dimensional CycleGAN and translate it into a real room impulse response. We also perform sub-band room equalization to the translated room impulse response to further improve the quality of the room impulse response. We artificially create far-field speech by convolving the LibriSpeech clean speech dataset [1] with room impulse response and adding background noise. We show that far-field speech simulated with the improved room impulse response using our approach reduces the word error rate by up to 19.9% compared to the unmodified room impulse response in Kaldi LibriSpeech far-field automatic speech recognition benchmark [2].
翻译:我们提出一种方法来提高使用声学模拟器对远方语音识别任务产生的合成室脉冲反应的质量。我们利用我们的新颖的、单维的循环GAN结构,弥合合成室脉冲反应与真实室脉冲反应之间的差距。我们将合成室脉冲反应以原始波形声音的形式传送到我们的单维循环GAN,并将其转换成真正的室脉冲反应。我们还对翻译室脉冲反应实行分带平衡,以进一步提高室脉冲反应的质量。我们通过使用LibriSpeech清洁语音数据集[1]与室脉冲反应和增加背景噪音,人为地创造了远方话。我们用改进室脉冲反应的方法模拟了远方话音反应,比Kaldi LibriSpeech远场自动语音识别基准减少了19.9%的字错率[2]。