Spoken keyword spotting (KWS) is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge in order to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need, on the one hand, to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation, and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a State-of-the-Art SOTA 99.59% result outperforming previous approaches.
翻译:口令检测是识别音频流中的关键字,因此在智能设备的边缘广泛用于激活语音助手以及执行免提任务。这项任务很困难,因为需要在保证高精度的同时,确保这些系统在功耗低、计算能力可能有限的设备上继续运行。本文介绍了AraSpot,它使用不同的在线数据增强和引入了ConformerGRU模型体系结构,基于40个阿拉伯口令进行训练。最后,我们通过训练文本到语音模型来生成合成数据,进一步提高模型的性能。AraSpot 取得了99.59%的最新一项技术成果,超越了以往的方法。