End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters. However, the effectiveness of neural-based approaches to this task is severely limited by the available training corpus, especially for domain adaptation where in-domain triplet training data is scarce or nonexistent. In this paper, we propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system. To this end, we first incorporate an additional encoder into the pre-trained E2E-ST model to realize text translation modelling, and then unify the decoder's output representation for text and speech translation tasks by reducing the correspondent representation mismatch in available triplet training data. During domain adaptation, a k-nearest-neighbor (kNN) classifier is introduced to produce the final translation distribution using the external datastore built by the domain-specific text translation corpus, while the universal output representation is adopted to perform a similarity search. Experiments on the Europarl-ST benchmark demonstrate that when in-domain text translation data is involved only, our proposed approach significantly improves baseline by 12.82 BLEU on average in all translation directions, even outperforming the strong in-domain fine-tuning method.
翻译:终端到终端语音翻译(E2E-ST)因其传播错误较少、延迟度较低和参数较少而受到越来越多的关注。然而,现有培训资料严重限制了基于神经的方法对这项任务的有效性,特别是在内部三重培训数据稀缺或不存在的域性适应方面,现有培训资料严重限制了以神经为基础的方法对这方面任务的有效性,特别是在内部三重培训数据稀缺或不存在的域性适应方面。在本文件中,我们提出了一个新的非参数方法,利用特定域文本翻译资料,实现E2E-ST系统的域性适应。为此,我们首先将额外的编码器纳入经过预先培训的E2E-ST模型,以实现文本翻译建模,然后通过减少现有三重培训数据中的代理代表不匹配,统一文本和语音翻译任务的解码器输出。在进行域性调整时,我们引入了 k-earest-neighbor (kNNN) 分类,以便利用特定域文本翻译资料所建的外部数据库进行最终翻译分发,同时采用通用产出代表制来进行类似的搜索。甚至对EOP-ST的输出进行实验,在拟议的平均翻译方法中,只有BSustimal-L基准格式,因此,在拟议的翻译方法中要大幅度改进了我们所有基准中的数据方向。