End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions. However, the training of end-to-end methods relies on parallel ST data, which are difficult and expensive to obtain. Fortunately, the supervised data for automatic speech recognition (ASR) and machine translation (MT) are usually more accessible, making zero-shot speech translation a potential direction. Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space, resulting in much worse performance compared to the supervised ST methods. In order to enable zero-shot ST, we propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text. Specifically, we introduce a vector quantization module to discretize the continuous representations of speech and text into a finite set of virtual tokens, and use ASR data to map corresponding speech and text to the same virtual token in a shared codebook. This way, source language speech can be embedded in the same semantic space as the source language text, which can be then transformed into target language text with an MT module. Experiments on multiple language pairs demonstrate that our zero-shot ST method significantly improves the SOTA, and even performers on par with the strong supervised ST baselines.
翻译:端到端语音翻译(ST)旨在将源语言语言语言转换成目标语言文本,而不会产生中间转录,然而,端到端方法的培训依赖于平行的ST数据,这些数据很难获得,而且费用昂贵。幸运的是,自动语音识别和机器翻译的监管数据通常更容易获得,使得零发语音翻译成为潜在的方向。现有的零发语言翻译方法未能将两种语言和文本模式统一成一个共同的语义空间,导致与受监督的ST方法相比的性能更差得多。为了能够实现零弹射ST,我们建议采用新型的Discrete交叉模式对齐(DCMA)方法,使用共同的离散词汇空间来容纳和匹配语音和文本模式。具体地说,我们引入一个矢量分化模块,将连续的语音和文字表述分为一套有限的虚拟象征,并利用ASR数据将相应的语音和文字与同一虚拟符号在共享的代码簿中。这样,源语言空间可以嵌入同一个语系,例如源语言文本,使用共同的离散式词汇空间,从而大幅改进SOMTA标准。