The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between teacher-forcing during training and conditioning on CTC outputs during testing, we also propose sampling CTC outputs during training. Experimental evaluations on three corpora show that Fast-MD achieved about 2x and 4x faster decoding speed than that of the na\"ive MD model on GPU and CPU with comparable translation quality. Adopting the Conformer encoder and intermediate CTC loss further boosts its quality without sacrificing decoding speed.
翻译:多解码器(MD) 端对端语音翻译模式(MD) 通过寻找更好的中间自动语音识别(ASR) 解码器(HI) 的隐藏中间代号(ASR) 解码器(DHI), 显示了高翻译质量。 这是一个双通解码模式, 将总体任务分解成 ASR 和机器翻译子任务。 然而, 解码速度对于真实世界应用程序来说不够快, 因为它在推断过程中对两个子任务进行光束搜索。 我们提议快速MD, 快速解码模型, 快速解码模型, 以非自动递增(NAR) 时间分类(CTC) 的中间解码生成 HI, 并随后为 ASR 解码器。 我们调查了两种类型的NAR HI : (1) 通过使用自动递增变压变压器 ASR 解码器(ASR) 解码器和 (2) 掩码 HIHI, 因为它结合了CTC和蒙蔽语言模式。 为了减少教师在测试期间对教师的分解和调时的中间解速度,我们提议在培训期间对 CROCEB公司质量进行抽样分析, 4的快速评估, 和加速分析。