When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion. We compare this approach with streaming alternatives based on vanilla Average layer, and canonical LSTMs, and show: (i) that all the models show only a small degradation in accuracy compared with the invocation-specific models, and (ii) that the newly introduced streaming TCN consistently performs better or comparable with the alternatives, while mitigating device undirected speech faster in time, and with (relative) reduction in runtime peak-memory over the LSTM-based approach of 33% vs. 7%, when compared to a non-streaming counterpart.
翻译:当与手机或可磨损器等智能设备进行互动时,用户通常会使用虚拟助理(VA)说一个关键字,或者在设备上按一个按钮。但在许多情况下,类似关键字的语音或意外按钮按键可能会不小心地援引VA,这可能对用户的经验和隐私产生影响。为此,我们建议采用音响假触发减缓法,用于对视设备引导的语音探测,同时处理语音触发器和触碰性调试。为了便利模型在构件上部署,我们引入了新的流动决策层,根据时间变动网络的概念(TCN)[1],以计算效率著称。据我们所知,这是第一个在流式方式中检测不止一种发音型的以设备为导向的语音的调控调方法。我们把这个方法与基于香草平均层的流动式语音探测器和以触动性LSTMS为基础的调试方法进行比较,并显示:(i)所有模型仅显示不精确度,相对于时间变异的语音网络[1][1] 与不断递减的T-treal 模式相比,并持续地显示不精确度的流式,同时进行比新流式的变快的模型,并持续进行递减慢的变慢的变换(Lii)