Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. In recent years, unsupervised and self-supervised techniques for learning speech representation were developed to foster automatic speech recognition. Up to date, most of these approaches are task-specific and designed for within-task transfer learning between different datasets or setups of a particular task. In turn, learning task-independent representation of speech and cross-task applications of transfer learning remain less common. Here, we introduce an encoder capturing word-level representations of speech for cross-task transfer learning. We demonstrate the application of the pre-trained encoder in four distinct speech and audio processing tasks: (i) speech enhancement, (ii) language identification, (iii) speech, noise, and music classification, and (iv) speaker identification. In each task, we compare the performance of our cross-task transfer learning approach to task-specific baselines. Our results show that the speech representation captured by the encoder through the pre-training is transferable across distinct speech processing tasks and datasets. Notably, even simple applications of our pre-trained encoder outperformed task-specific methods, or were comparable, depending on the task.
翻译:最近深层次学习的突破往往依赖于代表性学习和知识转让。近年来,开发了未经监督和自我监督的学习语音代表技术,以促进自动语音识别。迄今为止,这些方法大多是针对具体任务,设计用于不同数据集或特定任务设置之间的任务内转移学习。反过来,学习独立任务性语音和跨任务性转移学习应用的学习仍然不太常见。在这里,我们引入了一个编码器,用于记录跨任务传输学习的词级语音表现。我们展示了在四种不同的语音和音频处理任务中应用预先培训的编码器:(一) 语音强化,(二) 语言识别,(三) 语音、噪音和音乐分类,以及(四) 语音识别。在每项任务中,我们将我们交叉任务性转移学习方法的绩效与具体任务性基线进行比较。我们的结果显示,通过培训前的编码所获取的语音代表可以跨越不同的语音处理任务和数据集。值得注意的是,我们预先培训的编码器的简单应用是取决于具体任务方法的可比性。