Smart audio devices are gated by an always-on lightweight keyword spotting program to reduce power consumption. It is however challenging to design models that have both high accuracy and low latency for accurate and fast responsiveness. Many efforts have been made to develop end-to-end neural networks, in which depthwise separable convolutions, temporal convolutions, and LSTMs are adopted as building units. Nonetheless, these networks designed with human expertise may not achieve an optimal trade-off in an expansive search space. In this paper, we propose to leverage recent advances in differentiable neural architecture search to discover more efficient networks. Our searched model attains 97.2% top-1 accuracy on Google Speech Command Dataset v1 with only nearly 100K parameters.
翻译:智能音频装置被一个总用轻量级关键字识别程序锁定,以减少电力消耗。然而,设计具有高精度和低潜度的精确和快速反应的模型却具有挑战性。已经做出了许多努力开发端到端神经网络,在网络中采用深度分离变异、时间变异和LSTMs作为建筑单位。然而,这些具有人类专长的网络在扩展的搜索空间中可能无法实现最佳交换。在本文中,我们提议利用不同神经结构搜索的最新进展来发现更高效的网络。我们搜索的模型在Google Speaction Dataset v1上达到了97.2%的精度,只有近100K参数。