This paper introduces neural architecture search (NAS) for the automatic discovery of end-to-end keyword spotting (KWS) models in limited resource environments. We employ a differentiable NAS approach to optimize the structure of convolutional neural networks (CNNs) operating on raw audio waveforms. After a suitable KWS model is found with NAS, we conduct quantization of weights and activations to reduce the memory footprint. We conduct extensive experiments on the Google speech commands dataset. In particular, we compare our end-to-end approach to mel-frequency cepstral coefficient (MFCC) based systems. For quantization, we compare fixed bit-width quantization and trained bit-width quantization. Using NAS only, we were able to obtain a highly efficient model with an accuracy of 95.55% using 75.7k parameters and 13.6M operations. Using trained bit-width quantization, the same model achieves a test accuracy of 93.76% while using on average only 2.91 bits per activation and 2.51 bits per weight.
翻译:本文介绍了在有限资源环境中自动发现端到端关键字识别模型的神经结构搜索(NAS) 。 我们使用一种不同的NAS 方法优化在原始声波形上运行的电动神经网络的结构。 在与NAS找到一个合适的 KWS 模型后, 我们用75.7k参数和13.6M 操作对重量和激活进行量化,以减少记忆足迹。 我们在Google语音命令数据集上进行了广泛的实验。 我们特别将我们的端到端方法与基于mel-频 cepstral 系数(MFCC)的系统进行了比较。 对于定量,我们比较了固定的位-with 量化和经过培训的位-with 量化。 我们仅使用NAS 就能获得一个精度为95.55%的高效模型。 使用75.7k参数和13.6M 操作。 我们用经过培训的位-wid 的四分化方法, 同样的模型实现了93.76%的测试精度, 而平均只使用每个激活2.91位和2.51位重量的测试精度。