We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 17\% on the energy decay relief and 22\% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9\% in word error rate).
翻译:我们在远场自动语音识别(ASR)的下游应用场景中,提出了一种盲目的室内仿真响应(RIR)估计系统的新方法。我们首先建立改进RIR估计和改进ASR性能之间的联系,作为评估神经RIR估计器的手段。然后,我们提出了一种基于生成对抗网络(GAN)架构的方法,用于从混响语音中编码RIR特征并构建RIR。该方法使用一种新型的能量衰减缓解损失来优化捕捉输入混响语音的基于能量的属性。我们表明,我们的模型在声学基准测试上(能量衰减缓解提高了17%,早期反射能量指标提高了22%),以及ASR评估任务中(单词错误率降低了6.9%)都优于最先进的基准线。