We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different methods to mix speaker and speech information in the output embedding sequence, and propose a simple dynamic approach to balance the speech and speaker recognition loss functions. Our multi-task learning networks can produce a shared speaker and speech embedding, which are evaluated on the LibriSpeech and VoxCeleb test sets, and achieve a performance comparable to separate single-task models. Code is available at https://github.com/nikvaessen/2022-repo-mt-w2v2.
翻译:我们为两种正统语音技术任务研究多任务学习:语音和扬声器识别。我们用 wav2vec2 作为一种基础架构,有两个特定任务输出头。我们实验了不同的方法,在输出嵌入序列中混合语音和语音信息,并提出了一个简单的动态方法来平衡语音和语音识别损失功能。我们的多任务学习网络可以产生一个共同的语音和语音嵌入,在LibriSpeech 和 VoxCeleb 测试集中进行评估,并取得与不同的单塔式模型相似的性能。代码可在 https://github.com/nikvaessen/2022-repo-mt-w2v2上查阅。</s>