We present a method for improving the quality of synthetic room impulse responses for far-field speech recognition. We bridge the gap between the fidelity of synthetic room impulse responses (RIRs) and the real room impulse responses using our novel, TS-RIRGAN architecture. Given a synthetic RIR in the form of raw audio, we use TS-RIRGAN to translate it into a real RIR. We also perform real-world sub-band room equalization on the translated synthetic RIR. Our overall approach improves the quality of synthetic RIRs by compensating low-frequency wave effects, similar to those in real RIRs. We evaluate the performance of improved synthetic RIRs on a far-field speech dataset augmented by convolving the LibriSpeech clean speech dataset [1] with RIRs and adding background noise. We show that far-field speech augmented using our improved synthetic RIRs reduces the word error rate by up to 19.9% in Kaldi far-field automatic speech recognition benchmark [2].
翻译:我们提出了一个提高合成室脉冲反应质量的方法,用于远方语音识别。我们用我们的新颖的TS-RIRGAN结构弥合合成室脉冲反应(RIRs)与真实室脉冲反应(RIRs)真实室脉冲反应之间差距。鉴于合成RIR(合成RIR)的形式为原始音频,我们使用TS-RIRGAN(合成RIR)将其转化为真正的RIR。我们还在合成RIR(翻译合成RIR)上实现了现实世界次带室的均衡。我们的总体方法通过补偿低频波效应(与真实RIRs相似)来提高合成RIRs的质量。我们评估了合成RIRs在远方语音数据集上的改进合成RIRs(与RIRs(LibriSpeech)清洁语音数据集[1]和添加背景噪音,从而强化了远方语音数据数据集的功能[2]。我们显示,远方话用改进的合成RIRs(合成RIRs)将字错率降低到19.9%的卡迪远方自动语音识别识别基准[2]。