We propose to characterize and improve the performance of blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a GAN-based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 72% on the energy decay relief and 22% on an early-reflection energy metric), as well as in an ASR evaluation task (by 6.9% in word error rate).
翻译:我们提议在下游应用情景(远野自动语音识别(ASR))中描述和改进盲室冲动反应估计系统(RIR)的性能。我们首先将改进RIR估计与改进ASR性能挂钩,以此作为评价神经RIR测算器的一种手段。然后我们提议一个基于GAN的架构,将RIR的性能从回声中编码,并根据编码特征构建RIR,并使用新的能源衰减救济损失优化,以捕捉输入反动词的能量特性。我们显示,我们的模型优于声学基准的最新基线(以72%的能量衰减率和22%的早期反射能度衡量标准),以及ASR的评估工作(以6.9%,文字误差率)。