The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed for reducing this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models.
翻译:语言和文本模式之间的差距是语言到文字翻译(ST)中的一大挑战。 已经为缩小这一差距提出了不同的方法,但大多数方法都需要在ST培训中进行建筑变革。 在这项工作中,我们提议在培训前阶段缓解这一问题,不需要修改ST模式。 首先,我们表明连接器时间分类(CTC)损失可以通过设计来缩小模式差距。我们提供了与更常见的交叉机体损失的定量比较,表明与CTC一起的训练前始终能取得更好的ST最终准确性。然而,CTC只是部分解决办法,因此,在我们的第二份材料中,我们建议采用新的培训前方法,将CTC和最佳运输结合起来,以进一步缩小这一差距。我们的方法是先在培训前,先是用两个编码器,一个是用于声音输入,另一个是文字输入,这样它们就能够产生在瓦塞斯坦空间彼此接近的表述。 标准COVST-2和MST-C数据集的广泛实验表明,我们的培训前方法适用于Vanilla ender-coder-adtrake-dal-dal dival disal disal dismain sust the the fol-dal-dal-deal disgard disstration State Statestemstrages be the the the the dust the the the froutegrogrogrogroutes.