We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.
翻译:我们为基于关注的端到端语音识别模型建议了一个多任务培训方法。 我们通过多任务培训将解码器规范成一个监听、出场和拼写模型,对它进行音频文本和纯文本数据的培训。 有关LibriSpeech100小时子集的培训,这个拟议方法不需要额外的语言模型,比基线提高了11%的相对性能,并接近了测试清洁评价集中语言模型浅质融合的性能。 我们观察到整个960小时LibriSpeech培训集的类似趋势。 对不同类型错误和样本输出句的分析表明,拟议方法可以纳入语言级信息,表明其在现实应用中的有效性。