Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are learned without any task-specific supervision, they can also be useful for other voice-activated tasks like speaker verification, keyword spotting, emotion classification etc. In our work, we propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks. We develop downstream network architectures that operate on the contextualized speech representations of wav2vec 2.0 to adapt the representations for solving a given task. Finally, we extend our framework to perform multi-task learning by jointly optimizing the network parameters on multiple voice activated tasks using a shared transformer backbone. Both of our single and multi-task frameworks achieve state-of-the-art results in speaker verification and keyword spotting benchmarks. Our best performing models achieve 1.98% and 3.15% EER on VoxCeleb1 test set when trained on VoxCeleb2 and VoxCeleb1 respectively, and 98.23% accuracy on Google Speech Commands v1.0 keyword spotting dataset.
翻译:诸如 wav2vec 2. 0 等自我监督的学习方法,在从无标签和未注明的语音数据中学习有助于语音识别的未贴标签和未贴标签的语音数据中,在学习语音演示方面已经显示出令人乐观的成果。由于这些演示是在没有任何具体任务监督的情况下学习的,因此这些演示也可以用于其他语音激活任务,如语音校验、关键字识别、情绪分类等。 在我们的工作中,我们提出了一个通用目的框架,用于为不同语音激活任务修改预先训练的 wav2vec 2.0 模式。我们开发了下游网络结构,在 wav2vec 2. 0 的背景化语音演示中运作,以调整表达方式解决特定任务。最后,我们扩展了框架,以开展多任务学习,通过共同优化多个语音激活任务的网络参数,使用共用变压器主干网。我们单一和多任务框架在语音校验和关键字定位基准中都取得了最新结果。我们的最佳执行模式在VoxCeleb2 和 VoxCeleb1 测试组上分别达到1. 98%和3.15 % EER 测试设置时,在GoogleSO shold 命令 v.1.0% custainSet