Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.
翻译:神经传感器被广泛用于自动语音识别( ASR ) 。 在本文中, 我们将其引入到端到端语音翻译( ST) 流中, 目的是直接将音频信号转换为其他语言的文本。 与进行自动识别( MAT) 的升级的自动识别( MAT) 相比, 拟议的基于变换器传感器( TT) 的ST 模型极大地减少了推导延迟时间, 利用语音信息, 并避免从 ASR 到MT 的错误传播 。 为了提高模拟能力, 我们建议关注在TT 中联合网络。 此外, 我们将基于TT的ST 扩大到多语种ST, 后者同时生成多种语言文本。 大规模5万( K)小时的假标签培训的实验结果显示, TT 的ST 不仅大大缩短了推导时间, 也超越了用于英语- 德语翻译的不流的自动递增ST 。