We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.
翻译:我们引入了Wav2Seq, 这是对语音数据编码解码模式两个部分进行预培训的第一种自我监督方法。 我们引导一种假语言, 将其作为一种紧凑的离散表达方式, 并设计一个自我监督的假语音识别任务, 将音频输入转换成假子字序列。 这个过程可以独立存在, 或者可以作为低成本的第二阶段培训前应用。 我们实验自动语音识别( ASR ), 语音名称实体识别, 以及语音对文本翻译。 我们为终端到终端的语音名称实体识别设定了新的最新状态, 并且对20种语言配对的语音对文本翻译展示了一致的改进, 即使相互竞争的方法使用额外的文本数据进行培训。 最后, 关于 ASR, 我们的方法使得编码解码器解码器方法能够从网络所有部分的预培训中受益, 并展示了与高度优化的近期方法相似的性能 。