Training end-to-end speech translation (ST) systems requires sufficiently large-scale data, which is unavailable for most language pairs and domains. One practical solution to the data scarcity issue is to convert machine translation data (MT) to ST data via text-to-speech (TTS) systems. Yet, using TTS systems can be tedious and slow, as the conversion needs to be done for each MT dataset. In this work, we propose a simple, scalable and effective data augmentation technique, i.e., SpokenVocab, to convert MT data to ST data on-the-fly. The idea is to retrieve and stitch audio snippets from a SpokenVocab bank according to words in an MT sequence. Our experiments on multiple language pairs from Must-C show that this method outperforms strong baselines by an average of 1.83 BLEU scores, and it performs equally well as TTS-generated speech. We also showcase how SpokenVocab can be applied in code-switching ST for which often no TTS systems exit. Our code is available at https://github.com/mingzi151/SpokenVocab
翻译:培训端到端语音翻译(ST)系统需要足够大规模的数据,大多数语言配对和域都无法获得这些数据。数据稀缺问题的一个实际解决办法是通过文本到语音(TTS)系统将机器翻译数据(MT)转换为ST数据。然而,使用TTS系统可能会是乏味和缓慢的,因为每个MT数据集都需要进行转换。在这项工作中,我们提议一种简单、可扩展和有效的数据增强技术,即SpokenVocab, 将MT数据转换为ST在实时上的数据。其想法是将机器翻译数据(MT)转换成ST数据。将SpokenVocab银行的音频片按照MT序列中的单词进行检索和缝合。我们对MT-C多语言配对的实验显示,该方法平均超过1.83 BLEE的强基线,并同样使用TS生成的语音。我们还展示了如何将SpokenVocab应用于代码转换ST,而TTTS系统通常没有退出。我们的代码可以在 https://gibs/Scommovgen/Sgo上查到。