Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10~15% word error rate reduction on the English Librispeech task, and improves the speech translation quality on the MuST-C tasks by 4.2~11.1 BLEU.
翻译:以关注为基础的顺序建模为需要将一个序列映射为不同序列的应用程序提供了一个强大而优雅的解决方案。 它的成功在很大程度上取决于大量培训数据的可用性。 这对语音应用提出了挑战,因为贴标签的语音数据非常昂贵,例如自动语音识别和语音翻译。 在本研究中,我们提出了一个通用的多任务学习框架,以利用ASR和ST任务的文本数据。 提议分别对ASR和ST任务进行两项辅助任务,即解密自动编码任务和机器翻译任务进行共同培训。 我们证明,将文本输入作为语音和文字输入的顺序可以减少语音和文字输入之间的差别,并加强从文本组合到语音任务的知识转移。 我们的实验表明,拟议方法在英语Librispeech任务上实现了相对10-15%的字差率降低,并在BLEU中提高了 MuST-C任务的语音翻译质量,由4.2~11.1 BLEU。