ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.
翻译:ESPnet-ST-v2是一款开源工具包,是ESPnet-ST工具包的一次改版,满足了口语翻译社区扩大的需求。ESPnet-ST-v2支持1)离线语音到文本翻译(ST),2)同时语音到文本翻译(SST),以及3)离线语音到语音翻译(S2ST)--每个任务都支持多种方法,这使得ESPnet-ST-v2与其他开源口语翻译工具包有所区别。该工具包提供了最先进的架构,如传输器、混合CTC/注意力、具有可搜索中间结果的多解码器、同步块CTC/注意力、Translatotron模型和直接离散单元模型。在本文中,我们介绍了ESPnet-ST-v2的整体设计、每个任务的示例模型和性能基准测试,该工具包可在https://github.com/espnet/espnet公开获取。