While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases. This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task. To better model the conversational interaction, it is critical to discriminate disfluencies and end of query in order to allow the user to hold the floor for disfluencies while having the system respond as quickly as possible when the user has finished speaking. In this paper, we present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer. Our best system is obtained by jointly optimizing for ASR task and detecting when the user is paused to think or finished speaking. The proposed approach demonstrates over 97% recall rate and 85% precision rate on predicting true turn-taking with only 100 ms latency on a test set designed with 4 types of disfluencies inserted in conversational utterances.
翻译:虽然在许多应用程序中都使用了流动语音助理系统,但该系统通常侧重于非自然的、一发式的互动,假设一个声音查询输入一个输入,而没有犹豫或不流畅。然而,共同的谈话话语态除了杂乱之外,还经常涉及交替的多重询问。这些混乱包括:对思维、犹豫、单词延长、填充暂停和重复的语句的重复使用。这使得通过谈话演讲来语音识别,包括多问话,这是一项具有挑战性的任务。为了更好地模拟对话互动,必须区分混乱和查询的结束,以便让用户在用户发言结束后尽可能快地保持失常的下限,同时让系统尽快作出反应。在本文中,我们展示了在终端到终端语音识别器上建起的转动预测器。我们的最佳系统是通过联合优化 ASR 任务,并在用户暂停思考或结束发言时探测。 拟议的方法显示,在用户发言时,在用户结束对话时,只有100 mdirmination上设定了97%以上的回率和85%的精确率,同时预测真实的翻开式对话。