Traditional Query-by-Example (QbE) speech search approaches usually use methods based on frame-level features, while state-of-the-art approaches tend to use models based on acoustic word embeddings (AWEs) to transform variable length audio signals into fixed length feature vector representations. However, these approaches cannot meet the requirements of the search quality as well as speed at the same time. In this paper, we propose a novel fast QbE speech search method based on separable models to fix this problem. First, a QbE speech search training framework is introduced. Second, we design a novel model inference scheme based on RepVGG which can efficiently improve the QbE search quality. Third, we modify and improve our QbE speech search model according to the proposed model inference scheme. Experiments on keywords dataset shows that our proposed method can improve the GPU Real-time Factor (RTF) from 1/150 to 1/2300 by just applying separable model scheme and outperforms other state-of-the-art methods.
翻译:传统的逐个查询语音搜索方法通常使用基于框架级特征的方法,而最先进的方法则往往使用基于声词嵌入(AWES)的模型,将变长音频信号转换成固定长度的矢量表示方式。然而,这些方法无法同时满足搜索质量和速度的要求。在本文件中,我们提议了一种基于可分离模型的新型快速QbE语音搜索方法来解决这个问题。首先,引入了QbE语音搜索培训框架。第二,我们设计了一种基于REVGG的新型模型推论方法,能够有效地改进QBE搜索质量。第三,我们根据拟议的模型推论方法修改和改进了我们的QbE语音搜索模型。对关键词数据集的实验表明,我们提出的方法可以通过仅仅应用可分解模型方案,并超越其他状态方法,将GPU实时系数从1/150改进到1/2300。