In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query.However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations. Toavoid repeating a hotword, we propose a streaming end-to-end(E2E) intended query detector that identifies the utterancesdirected towards the device and filters out other utterancesnot directed towards device. The proposed approach incor-porates the intended query detector into the E2E model thatalready folds different components of the speech recognitionpipeline into one neural network.The E2E modeling onspeech decoding and intended query detection also allows us todeclare a quick intended query detection based on early partialrecognition result, which is important to decrease latencyand make the system responsive. We demonstrate that theproposed E2E approach yields a 22% relative improvement onequal error rate (EER) for the detection accuracy and 600 mslatency improvement compared with an independent intendedquery detector. In our experiment, the proposed model detectswhether the user is talking to the device with a 8.7% EERwithin 1.4 seconds of median latency after user starts speaking.
翻译:在语音应用程序中, 通常使用预设的热字来激活设备, 以便处理查询 。 但是, 每次在持续对话中, 发言询问后加上一个热字, 都会在每次时空对话中带来认知负担。 避免重复一个热字, 我们建议使用流式端到端( E2E) 想要的查询检测器, 以辨别对设备的表达式和过滤器, 并排出对设备的其他发音。 提议的 E2E 方法将预想的查询探测器切换到 E2E 模型中, 将语音识别管道的不同组成部分折叠到一个神经网络中。 E2E 模拟语音解码和预想的查询检测也使我们能够根据早期部分识别结果, 确定一个速进端到端( E2E2E) 的快速查询检测器, 这对于降低悬浮度和系统反应非常重要。 我们证明, 拟议的 E2EE2E 方法将平均误率提高22%, 将预期的查询器的精确度提高600 mslatenentency 和独立预定的电感官探测器。 在我们的实验中, 1.4 用户开始在8.7 秒后, 的中位 。