Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.
翻译:以音频数据为基础的纯文本和半监督培训最近由于广泛提供无标签文本和语音数据而变得受欢迎。 在这项工作中,我们提议将仅文本和半监督培训纳入基于关注的评议模式。 通过将纯文本数据纳入用于审议文本编码器的变压器双向编码器(BERT),以及使用联合声频和文本解调器(JATD)和半监督培训的大规模文本对语音和音频单词评析,我们实现了与基线评议相比,各种任务减少了4%-12%的WER。与最新语言模型(LM)重新校正方法相比,审议模型将谷歌语音搜索编码器的相对比例减少了11%。我们表明,审议模型还实现了与具有合理终端延迟状态的高级LM调调频器相比的积极的人类侧侧侧侧评价。