Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters.
翻译:关键字定位是一个重要的研究领域, 因为它在智能设备上的设备提醒和用户互动中发挥着关键作用。 但是, 在手机等资源有限的设备上高效运行的同时, 最大限度地减少错误是具有挑战性的。 我们展示了一种广播剩余学习方法, 以小模型大小和计算负荷实现高精度。 我们的方法将大多数剩余功能配置为 1D 时间变换, 同时仍然允许 2D 一起变换, 使用广播- 重复连接, 将时间输出扩大至频率- 时空维度。 此剩余绘图使网络能够有效地代表有用的音频特性, 其计算量远低于常规的超时空神经网络。 我们还在广播剩余学习的基础上提出了一个新型网络结构, 即 BC- ResNet (BC- redial Net), 并描述如何根据目标设备的资源扩大模型。 BC-ResNets在谷歌语音指令数据元件 v1 和 v2 上分别实现了98. 0 % 和98.7% 最高-1 精确度的状态, 并持续超越先前的方法, 使用较少的计算和参数。