This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker's identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance. The code is available at https://github.com/kaistmm/Dub-S2ST.
翻译:本文介绍了一种跨语言配音系统,该系统能够将一种语言的语音翻译为另一种语言,同时保持时长、说话人身份和语速等关键特征。尽管现有语音翻译方法具有强大的翻译质量,但它们往往忽略了语音模式的传递,导致与源语音不匹配,从而限制了其在配音应用中的适用性。为了解决这个问题,我们提出了一种基于离散扩散的语音到单元翻译模型,该模型具有显式的时长控制功能,能够实现时间对齐的翻译。然后,我们使用条件流匹配模型,基于翻译后的单元和源说话人身份合成语音。此外,我们引入了一种基于单元的语速适应机制,该机制引导翻译模型以与源语音一致的速率生成语音,而无需依赖任何文本。大量实验表明,我们的框架能够生成自然流畅的翻译,其时长和语速与原始语音保持一致,同时实现了具有竞争力的翻译性能。代码可在 https://github.com/kaistmm/Dub-S2ST 获取。