We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. The long context window of 1.5 seconds allows us to capture the low frequency temporal modulations of speech in the spectrogram. For an end-to-end automatic speech recognition task, the FDLP-spectrogram performs at-par with the standard mel-spectrogram features for clean read speech training and test data. For more realistic mismatched train-test situations and noisy, reverberated training data, the FDLP-spectrogram shows up to 25% and 22% WER improvements over mel-spectrogram respectively.
翻译:我们建议使用频度内线性预测(DFLP)计算光谱技术,该技术使用全极模型来匹配不同频率子波段的Hilbert语音信封。完整语音的光谱是通过相连接的全极模型响应的重叠附加计算出来的。1.5秒长的上下文窗口允许我们捕捉光光谱中低频时间调制的语音。为了完成端到端自动语音识别任务,FDLP-spectrogram与标准的中位光谱特征同时进行,用于清洁读话培训和测试数据。对于更符合现实的不匹配的火车测试情况以及噪音、变动的培训数据,FDLP-光谱显示在Mel-spectrogram上分别达到25%和22%的WER改进率。