Acoustic modeling of raw waveform and learning feature extractors as part of the neural network classifier has been the goal of many studies in the area of automatic speech recognition (ASR). Recently, one line of research has focused on frameworks that can be pre-trained on audio-only data in an unsupervised fashion and aim at improving downstream ASR tasks. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems. In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features as well. Another neural front-end which is only trained together with the supervised ASR loss as well as traditional Gammatone features are applied for comparison. Moreover, it is shown that the AM can be retrofitted with i-vectors for speaker adaptation. Finally, the described features are combined in order to further advance the performance. With the final best system, we obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
翻译:作为神经网络分类器一部分的原始波形和学习特征提取器的声学模型是自动语音识别(ASR)领域许多研究的目标。最近,一行研究侧重于能够以不受监督的方式预先训练只听音数据的框架,目的是改进下游的ASR任务。在这项工作中,我们调查了这些前端框架之一,即 wav2vec对混合ASR系统的有用性。除了部署预先训练的特征提取器外,我们还探讨如何利用现有的具有不同特点的同一任务培训的声学模型。另外一种神经前端仅与受监督的ASR损失以及传统的Gammatone特性一起培训,用于比较。此外,还表明该神经前端可以与i-vecs重新配对语音调。最后,所描述的特征是结合的,以进一步推进性能。有了最后的最佳系统,我们比先前的LibSpeech测试和测试系统的最佳模型取得了4%和6%的相对改进。