We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.
翻译:我们探索将神经元分解器与双通道级联编码器 ASR 合并为单一模型。 关键的挑战是如何让分解器( 实时运行, 与解码器同步运行) 最终完成第二关口( 实时后运行900 ms ), 而不在推断过程中引入用户感知的延缓率或删除错误。 我们提出一个设计, 将神经元分解器与因果一传出解码器结合, 以便实时发出一个断层信号。 然后, EOS 信号被用于最终完成非闭路口第二关口。 我们尝试了不同的方式来最终完成第二关口, 并发现新颖的假框架注射策略可以同时带来高质量的第二关传承结果和低封存。 在现实世界的长式字幕任务( YouTube) 上, 我们实现了2.4% 相对 WER 和 140 ms EOS 延缓存率收益, 超过基于基线VAD 的分解器与同一级联的分解器的分解器。