In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS technology being embedded in a number of technologies such as voice assistants. Many of these models rely on large amounts of labelled data to achieve good performance. As a result, their use is restricted to applications for which a large labelled speech data set can be obtained. Self-supervised learning seeks to mitigate the need for large labelled data sets by leveraging unlabelled data, which is easier to obtain in large amounts. However, most self-supervised methods have only been investigated for very large models, whereas KWS models are desired to be small. In this paper, we investigate the use of self-supervised pretraining for the smaller KWS models in a label-deficient scenario. We pretrain the Keyword Transformer model using the self-supervised framework Data2Vec and carry out experiments on a label-deficient setup of the Google Speech Commands data set. It is found that the pretrained models greatly outperform the models without pretraining, showing that Data2Vec pretraining can increase the performance of KWS models in label-deficient scenarios. The source code is made publicly available.
翻译:近年来,精确的深关键字识别模型(KWS)的开发导致KWS技术被嵌入语音助理等许多技术中。许多这些模型依靠大量贴标签的数据才能取得良好绩效。因此,这些模型的使用仅限于可以获取大标记语音数据集的应用。自监学习的目的是通过利用无标签数据来减轻对大标记数据集的需要,而这种数据在大数量上更容易获得。然而,大多数自监方法只对非常大的模型进行了调查,而KWS模型则希望是小的。在本文中,我们调查了在标签缺失的情景下,如何对较小的KWS模型进行自我监督的预培训。我们使用自监的Geyword变换模型预先设计了自监框架Dat2Vec, 并试验了谷歌语音指令数据集的标签缺陷设置。发现,预先培训的模型在没有预先培训的情况下大大优于模型。显示DA2Vec预培训可以提高KWS模型在标签源代码假设中的可公开表现。