The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharingmechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.
翻译:端到端语音翻译(E2E-ST)模型因其低延迟和少量错误传播而逐渐成为主流范例。然而,由于任务复杂性和数据稀缺性,训练这样的模型并不容易。语音和文本模态的差异导致 E2E-ST 模型性能通常低于相应的机器翻译(MT)模型。基于以上观察,现有方法常常通过施加各种约束条件来使用共享机制进行隐式知识转移。然而,最终的模型通常在 MT 任务上表现不如单独训练的 MT 模型,这意味着这种方法的知识转移能力也是有限的。
为了解决这些问题,我们提出了面向 E2E-ST 的 FCCL(Fine- and Coarse- Granularity Contrastive Learning)方法,通过跨模态多粒度对比学习来实现显式的知识转移。我们方法的一个关键部分是在句子和帧级别同时应用对比学习,为提取包含丰富语义信息的语音表示提供全面的指导。此外,我们采用一个简单的白化方法来缓解 MT 模型中的表示退化现象,该现象会影响对比学习。在 MuST-C 基准测试上的实验表明,我们提出的方法显著优于所有八种语言对的 E2E-ST 基线。进一步的分析表明,FCCL 可以释放其学习语法结构信息的能力,并迫使更多的层学习语义信息。