Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters and propose a mechanism to enhance utterance-wise distinction, which proves crucial for improving performance on classification tasks. On the Google speech commands v2 dataset, the proposed method applied to the Auto-Regressive Predictive Coding S3RL led to a 1.2% accuracy improvement compared to training from scratch. On an in-house KS dataset with four different keywords, it provided 6% to 23.7% relative false accept improvement at fixed false reject rate. We argue this demonstrates the applicability of S3RL approaches to light-weight models for KS and confirms S3RL is a powerful alternative to traditional supervised learning for resource-constrained applications.
翻译:自我监督的语音演示学习( S3RL) 正在革命我们如何利用不断增长的数据可用性。 S3RL 相关研究通常使用大型模型,但我们使用轻量级网络来遵守计算限制装置的严格记忆。 我们用330k参数的变压器来展示S3RL在关键词点选(KS)问题上的有效性,并提议一个机制来强化言语区分,这证明对改进分类任务绩效至关重要。 在Google语音命令 v2数据集中,对自动递增式 S3RL 应用的拟议方法与从零开始的培训相比提高了1.2%的准确性。在内部的有4个不同关键字的 KS 数据集中,它提供了6%至23.7%的相对假数据,以固定的错误拒绝率接受改进。 我们争辩说,这显示了S3RL 方法对轻量模型对KSS的可适用性,并确认S3RL 是资源调控应用程序传统监管学习的有力替代方法。</s>