We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.
翻译:我们介绍JOIST, 这是一种用语音文本对齐投入和文本单非偏差投入来培训流、级联、编码终端到终端(E2E)模式的算法。与以往的工作不同,我们探索两种模式的联合培训,而不是培训前和微调。此外,我们探索JOIST, 使用流流E2E模式, 其数量级更高, 数据也比以往的工作更加新颖。我们通过一系列通缩研究, 探索不同类型的文本模型, 包括如何建模文本序列的长度和适当的文本子词单位代表。我们发现,与没有经过文字培训的模型相比, JOIST 的最佳文本代表将各种搜索和稀有文字测试组合的WER改进了4-14 % 。 此外, 我们量化地表明, JOIST 保持流能力, 这对良好的用户级经验很重要 。