Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.
翻译:语音到语音翻译(S2ST)旨在将一种语言的语音转换为语义上等效的另一种语言的语音,以促进不同语言使用者之间的交流。语音到离散单元翻译(S2UT)作为端到端S2ST的主流方法,解决了传统级联系统中常见的模块间错误传播和推理速度慢等挑战。然而,由于离散单元主要捕捉内容信息,传统的S2UT方法无法保留源语音中的说话人特定特征。我们先前的工作SC-S2UT引入了说话人适配器和单元到梅尔频谱结构,实现了说话人信息的保持和非自回归语音生成。在此基础上,本研究提出了一种自监督预训练方法,以丰富说话人适配器和单元到梅尔频谱结构提取的信息。此外,我们研究了不同的特征融合策略,以进一步提升说话人与内容特征的整合。在CVSS-T数据集上针对ES-EN和FR-EN任务进行的实验表明,与SC-S2UT相比,我们提出的方法实现了BLEU分数1.14的提升,同时在MOS和说话人相似度方面均有显著改善。此外,我们的方法在保持高说话人相似度的同时,达到了与传统S2UT相当的翻译质量,且推理时间每话语仅增加0.04秒。这些结果验证了所提方法的有效性。