ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.
翻译:ESPnet-ST-v2是开源工具包ESPnet-ST的升级版,这是由于口译翻译社区兴趣的扩大所必需的。ESPnet-ST-v2支持1)离线语音到文本翻译(ST),2)同声语音到文本翻译(SST)和3)离线语音到语音翻译(S2ST)--每个任务都支持各种方法,这使ESPnet-ST-v2与其他开源口语翻译工具包区分开来。该工具包提供最先进的架构,例如转录器、混合CTC/注意力、具有可搜索中间体的多解码器、同步块CTC/注意力、Translatotron模型和直接离散单元模型。在本文中,我们描述了ESPnet-ST-v2的总体设计、每个任务的示例模型以及性能基准测试。该工具包公开在https://github.com/espnet/espnet。