Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10~15% word error rate reduction on the English Librispeech task compared with our baseline, and improves the speech translation quality on the MuST-C tasks by 3.6~9.2 BLEU.
翻译:基于关注序列到顺序的建模为需要将一个序列映射为不同序列的应用程序提供了一个强大而优雅的解决方案。 它的成功在很大程度上取决于大量培训数据的可用性。 这对有标签的语音数据非常昂贵的语音应用提出了挑战,例如自动语音识别(ASR)和语音翻译(ST)。 在本研究中,我们提出了一个通用的多任务学习框架,以利用ASR和ST任务的文本数据。 提议对两项辅助任务,即取消自动编码的任务和机器翻译任务,分别进行ASR和ST任务的培训。 我们证明,将文本输入作为语音序列,可以减少语音和文本输入之间的差别,加强从文本组合到语音任务的知识转移。 我们的实验表明,与我们的基线相比,拟议方法在英语Librispeech任务上实现了大约10-15%的字差率降低,并在3.6~9.2 BLEU上提高 MuST-C任务的语音翻译质量。