End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
翻译:终端到终端语音翻译(E2E ST)旨在将源语译为目标翻译,而不会产生中间文本,然而,现有的E2E ST 方法在只有有限的ST数据时会大大退化。我们认为,ST 模型的性能与它与语言和文字誊本的相似性密切相关。我们在本文件中提议Word-along Contrastrastive 学习(WACO),这是一小幅语音到文字翻译的一种新颖方法。我们的主要想法是通过对比性学习,将两种模式的字级表达方式联系起来。我们评价WACO和其他方法,即广泛使用的ST-C数据库。我们的实验表明WACO以0.7-8.5 BLEU点比最佳基线方法优,只有1小时的平行数据。代码可在https://anonymous.4open.science/r/WACO上查阅。