Interactive voice assistants have been widely used as input interfaces in various scenarios, e.g. on smart homes devices, wearables and on AR devices. Detecting the end of a speech query, i.e. speech end-pointing, is an important task for voice assistants to interact with users. Traditionally, speech end-pointing is based on pure classification methods along with arbitrary binary targets. In this paper, we propose a novel regression-based speech end-pointing model, which enables an end-pointer to adjust its detection behavior based on context of user queries. Specifically, we present a pause modeling method and show its effectiveness for dynamic end-pointing. Based on our experiments with vendor-collected smartphone and wearables speech queries, our strategy shows a better trade-off between endpointing latency and accuracy, compared to the traditional classification-based method. We further discuss the benefits of this model and generalization of the framework in the paper.
翻译:互动语音助理在各种情景中被广泛用作输入界面,例如智能家庭设备、可磨损器和AR设备。检测语音查询的结束,即语音终结点,是语音助理与用户互动的一项重要任务。传统上,语音终结点基于纯粹的分类方法以及任意的二进制目标。在本文件中,我们提议了一个新的基于回归的语音终结点模式,使终端点能够根据用户查询的情况调整其检测行为。具体地说,我们提出了一个暂停模式,并展示其动态最终点的有效性。根据我们对供应商收集的智能手机和可磨损式语音查询的实验,我们的战略显示,与传统的基于分类的方法相比,终端拉长点和准确性之间的权衡更为平衡。我们进一步讨论了这一模式的好处以及文件中对框架的概括化。