Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy. Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time. Likewise, this research area lacks standard evaluation protocols and well-curated benchmark datasets. In this work, we propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation. We curate a benchmark expressivity test set in the TV series domain and explored a second dataset in the audiobook domain. Finally, we present a human evaluation protocol to assess multiple expressive dimensions across speech pairs. Experimental results indicate that bi-lingual annotators can assess the quality of expressive preservation in S2ST systems, and the holistic modeling approach outperforms single-aspect systems. Audio samples can be accessed through our demo webpage: https://facebookresearch.github.io/speech_translation/cascade_expressive_s2st.
翻译:语音对语音的表达式翻译(S2ST)旨在将源言的预想性属性转移给目标语言,同时保持翻译准确性。 S2ST 的现有研究有限,通常以单一的表达式方面为重点。同样,这个研究领域缺乏标准的评价协议和精确的基准数据集。在这项工作中,我们建议为表达式S2ST建立一个整体级联系统,将以前仅在孤立情况下考虑过的多种手动式转让技术结合起来。我们在电视系列域内制作了一个基准表达式测试,并探索音频簿域内的第二个数据集。最后,我们提出了一个人类评估协议,以评估两对语音的多个表达性层面。实验结果显示,双语言的告示者可以评估S2ST系统表达式保存的质量,以及综合模型方法将单形系统外形。音样可以通过我们的演示网页https://facebookresearchy.github.io/speech_transccacade_aminalive_s2st查阅。