Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present \textbf{AVMuST-TED}, the first dataset for \textbf{A}udio-\textbf{V}isual \textbf{Mu}ltilingual \textbf{S}peech \textbf{T}ranslation, derived from \textbf{TED} talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1\%), LRS2 (25.5\%), and LRS3 (28.0\%).
翻译:然而,尽管研究人员探索了跨语言翻译技术,例如机器翻译和语音语音翻译,以克服语言障碍,但是仍然缺乏关于视觉语言的跨语言研究。这种研究的缺乏主要是由于缺少包含视觉语言和翻译文本配对的数据集。在本论文中,我们为\ textbf{{AUUST-TED}提供了第一组数据集,这是为\ textbf{A}A}udio-\ textb{V}}Textbf{MU}LTU}LU}LTUB{S}语音翻译等探索了跨语言翻译技术,但是从\ textbf{Tread}会谈中衍生出来的跨语言研究仍然很缺乏。然而,视觉语言并不象听得像音调一样,因此很难从源语言语音电话到目标语言文本进行绘图。 为了解决这一问题,我们建议MixSpeetSpeech, 一个跨模式的自学框架,它利用音频语言训练来规范视觉语言任务。为了进一步减少跨模式阅读差距及其对视觉语言流流学的影响,我们建议采用混合语言流学的流学战略来调整。</s>