In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the long-time dependency. To improve the robustness of KWS model, we also proposed an unsupervised learning method. The unsupervised loss is based on the similarity between the original and augmented speech features, as well as the audio reconstructing information. Two speech augmentation methods are explored in the unsupervised learning: speed and intensity. The experiments on Google Speech Commands V2 Dataset demonstrated that our CNN-Attention model has competitive results. Moreover, the augmentation based unsupervised learning could further improve the classification accuracy of KWS task. In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods, such as CPC, APC, and MPC.
翻译:在本文中,我们调查了一种基于语音增强的不受监督的语言识别(KWS)任务的语言强化学习方法。 KWS是一种有用的语言应用,但也在很大程度上依赖于标签数据。 我们设计了一个CNN-注意力结构来进行KWS任务。 有线电视新闻网的层次侧重于本地的声学特征,关注层模式是长期依赖的模型。为了提高KWS模型的稳健性,我们还提出了一个不受监督的学习方法。无监督的损失是基于原始和扩充的语音功能以及音频重建信息的相似性。在未经监督的学习中探索了两种语音增强方法:速度和强度。关于Google语音指令V2数据集的实验表明,我们的CNNP-注意力模型具有竞争性结果。此外,基于增强的无监督学习可以进一步提高KWS任务的分类准确性。在我们实验中,通过以增强为基础的非超强学习,我们的KWS模型比其他未经监督的方法(如CPC、CPC和MPC)取得更好的性。