Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance. In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low- and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than \textbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.
翻译:KWS系统面临的主要挑战在于高精度和低悬浮度之间的权衡。遗憾的是,对移动设备上的KWS模型的实际延缓度几乎没有进行定量分析。这尤其关系到常规的Convolution为基础的KWS方法,因为众所周知,常规的基于KWS方法需要大量的操作才能达到适当的性能水平。在本文件中,我们提议对移动设备上的实时KWS进行实时同步演动。与大多数需要深层结构才能充分捕捉低频和高频域的2D的KWS方法不同,我们利用紧凑的ResNet结构来利用时间演动。在Google语音指令数据集中,我们在Google Pixel上取得了超过\ textbf{385x}的速度,并且超过了与州端模型相比的准确性能。此外,我们还在移动设备模型(包括移动式模型)上发布了一个基线(包括移动式模型)的实施。