We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.
翻译:我们提出了一种新的两阶段端到端语音识别模型,通过训练配对数据和不配对文本数据的组合来提高ASR性能。先前,联合语音和文本解码器(JATD)通过在模型训练期间使用文本数据表现出很有前途的结果,而最近引入的决策结构则通过利用第一次通过解码结果来减少识别错误。我们的方法称为Deliberation-JATD,它将决策的拼写校正能力与JATD的使用不配对文本数据相结合,进一步提高性能。所提出的模型在多个测试集中产生了显着的收益,特别是那些关注罕见词汇的测试集,在这些测试集上,它将单词错误率(WER)相对减少了12%到22.5%。它不需要增加模型尺寸或需要多阶段训练,使Deliberation-JATD成为面向设备应用的高效候选项。