How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.
翻译:我们如何学习对口语及其书面文本的统一表述?学习对语义上相似的言语和文字的类似表述对语言翻译很重要。为此,我们提议ConST,这是终端到终端语音对文本翻译的跨模式对比学习方法。我们评估ConST和关于流行基准MST-C的以往各种基线,实验显示,拟议的ConST始终超越了以往方法,实现了29.4.平均BLEU。分析进一步证实,ConST确实缩小了不同模式的代表性差距 -- -- 其所学的代表性提高了跨模式语音文本检索的准确性,从4%提高到88 % 。代码和模型可在https://github.com/ReneYe/ConST上查阅。