We propose a novel method for generating scene-aware training data for far-field automatic speech recognition. We use a deep learning-based estimator to non-intrusively compute the sub-band reverberation time of an environment from its speech samples. We model the acoustic characteristics of a scene with its reverberation time and represent it using a multivariate Gaussian distribution. We use this distribution to select acoustic impulse responses from a large real-world dataset for augmenting speech data. The speech recognition system trained on our scene-aware data consistently outperforms the system trained using many more random acoustic impulse responses on the REVERB and the AMI far-field benchmarks. In practice, we obtain 2.64% absolute improvement in word error rate compared with using training data of the same size with uniformly distributed reverberation times.
翻译:我们提出一种新颖的方法来生成场景觉悟培训数据,用于远程自动语音识别。我们用一个深深的基于学习的测深器来从语音样本中不干扰地计算环境的亚波段反动时间。我们用其反动时间来模拟场景的声学特性,并使用多变量高斯分布来代表它。我们用这种分布从大型真实世界数据集中选择声学脉冲反应,用于增强语音数据。在现场觉悟数据方面受过培训的语音识别系统一直优于所培训的系统,在REWERB和AMI远方基准上使用许多随机声学脉冲反应。在实践上,我们获得的字差率绝对改善率为2.64%,而使用相同大小的培训数据与统一分布的回动时间相比。