In end-to-end (E2E) speech recognition models, a representational tight-coupling inevitably emerges between the encoder and the decoder. We build upon recent work that has begun to explore building encoders with modular encoded representations, such that encoders and decoders from different models can be stitched together in a zero-shot manner without further fine-tuning. While previous research only addresses full-context speech models, we explore the problem in a streaming setting as well. Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features, without modifying the pre-trained model. The features remain interchangeable when the model is retrained with distinct initializations. Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders, maintaining high-quality downstream performance. They are also rich enough to represent the first-pass prediction during two-pass deliberation. In this scenario, they outperform the N-best hypotheses, since they do not need to be supplemented with acoustic features to deliver the best results. Moreover, generating the Lego-Features does not require beam search or auto-regressive computation. Overall, they present a modular, powerful and cheap alternative to the standard encoder output, as well as the N-best hypotheses.
翻译:在端到端(E2E)语音识别模型中,编码器和解码器之间不可避免地出现了紧密耦合的代表性。我们在最近已经开始研究使用模块化编码表示构建编码器的基础上,使得来自不同模型的编码器和解码器可以以零增益的方式拼接在一起,而无需进行进一步的微调。尽管先前的研究仅涉及全文本语音模型,但我们也在流式设置下探索了这个问题。我们的框架建立在现有的编码表示之上,将其转换为模块化特征,称为Lego-Features,而无需修改预先训练的模型。在模型重新训练时,这些特征保持可互换,无需进一步的微调。尽管稀疏,但我们证明了Lego-Features在使用RNN-T或LAS解码器进行测试时是强大的,保持着高质量的下游性能。它们还足以表示经过两轮推理时的首次预测。在这种情况下,它们的表现优于N个最佳假设,因为它们不需要补充声学特征才能提供最佳结果。此外,生成Lego-Features不需要进行波束搜索或自回归计算。总的来说,它们是标准编码器输出以及N个最佳假设的模块化,强大和廉价的替代方案。