Keyword spotting aims to identify specific keyword audio utterances. In recent years, deep convolutional neural networks have been widely utilized in keyword spotting systems. However, their model architectures are mainly based on off-the shelfbackbones such as VGG-Net or ResNet, instead of specially designed for the task. In this paper, we utilize neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint. Specifically, we search the model operators and their connections in a specific search space with Encoder-Decoder neural architecture optimization. Extensive evaluations on Google's Speech Commands Dataset show that the model architecture searched by our approach achieves a state-of-the-art accuracy of over 97%.
翻译:关键字插图旨在识别特定关键词音频话语。 近年来,深演神经网络在关键字识别系统中被广泛使用。 但是,它们的模型结构主要基于VGG-Net或ResNet等离架后座的模型,而不是专门为此任务设计的模型。 在本文中,我们利用神经结构搜索设计进动神经网络模型,这些模型可以提高关键词识别功能的性能,同时保持可接受的记忆足迹。 具体地说,我们用Encoder-Decoder神经结构优化在特定的搜索空间中搜索模型操作员及其连接。 对谷歌语音指令数据集的广泛评估显示,我们所搜索的模型结构达到了超过97%的最先进的精确度。