In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.
翻译:在本文中,我们提出一种新的端到端用户定义关键字识别方法,该方法使用语音和文本序列之间的语言对应模式。与以往要求语音关键字注册的方法不同,我们的方法将输入询问与注册文本关键字序列进行比较。为了将音频和文本表达方式置于共同的潜在空间内,我们采用了一种基于注意的跨模式匹配方法,该方法以端到端方式培训,处理单声匹配损失和关键词分类损失。我们还利用声频嵌入网络去响损失,以提高噪音环境中的稳健性。此外,我们引入了LibriPhrase数据集,这是一套基于LibriSpeech的新的短句数据集,用于高效培训关键词识别模型。我们提议的方法在各种评价组合上取得了与其他单一模式和交叉模式基线相比的竞争性结果。