Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).
翻译:自动语音识别( ASR) 系统通常依赖于外部端点( EP) 模式来识别语言边界。 在这项工作中, 我们提出一种方法, 在一个单一端到端( E2E) 的多任务模型中联合培训 ASR 和 EP 任务, 通过可选利用 ASR 音频编码器的信息来提高 EP 质量 。 我们引入了“ 开关” 连接, 使 EP 直接或从 ASR 模式中直接或低水平潜伏表达面消耗音框。 这导致一个单一 E2E 模式, 可以在推断以低成本进行框架过滤时使用, 并且根据正在使用的 ASR 计算进行高质量的终端( EOQ) 预测 。 我们在语音搜索成套测试中展示了结果, 显示与单独的单一任务模型相比, 这种方法可以将中端点的惯性减少120 毫斯( 30.8%), 将 90% 的惯性降低 170 毫斯( 23.0 % ),, 而无需递减 单词错误率 。