Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work, we propose LiteFEW, a lightweight feature encoder for wake-up word detection that preserves the inherent ability of wav2vec 2.0 with a minimum scale. In the method, the knowledge of the pre-trained wav2vec 2.0 is compressed by introducing an auto-encoder-based dimensionality reduction technique and distilled to LiteFEW. Experimental results on the open-source "Hey Snips" dataset show that the proposed method applied to various model structures significantly improves the performance, achieving over 20% of relative improvements with only 64k parameters.
翻译:提供通用语音演示的自监管学习方法最近受到越来越多的关注。 Wav2vec 2. 0 是最著名的例子, 展示了许多下游语音处理任务的显著表现。 尽管它取得了成功, 但由于它昂贵的计算成本, 直接在移动设备上进行警醒单词检测仍具有挑战性。 在这项工作中, 我们提议使用LiteFEW, 一个用于警醒单词检测的轻量级特效编码器, 以最小的尺度保存 wav2vec 2.0 的固有能力。 在这个方法中, 通过引入基于自动孵化器的维度减少技术, 并提炼给LiteFEW, 来压缩预培训的 wav2vec 2. 0 的知识。 “ Hey Snips” 数据集的实验结果显示, 适用于各种模型结构的拟议方法极大地改善了性能, 实现了20%以上的相对改进,只有64k 参数。</s>