Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based translation, which generates target tokens in parallel by eliminating conditional dependencies, we study the problem of NAR decoding for E2E-ST. We propose a novel NAR E2E-ST framework, Orthoros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. We further investigate effective length prediction methods from speech inputs and the impact of vocabulary sizes. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality compared to state-of-the-art AR E2E-ST systems.
翻译:快速推断速度是实际部署语音翻译系统的一个重要目标。基于编码器解码器结构的端到端模型(E2E)比传统的级联系统更适合这一目标,但迄今为止尚未探索其在解码速度方面的效力。受最近非偏向(NAR)文本翻译方法的进展的启发,这些方法通过消除有条件依赖性同时产生目标符号,我们研究E2E-ST的NAR解码问题。我们提议了一个新型的NAR E2E-ST框架,Orthoros,其中NAR和自动递增(AR)解码器在共用语音编码器上共同接受培训。后者用于在从前者产生的不同长度候选人中选择更好的翻译,大大提高了大长度和可忽略不计间接费用的效力。我们进一步研究了从语音投入中有效预测方法和词汇大小的影响。对四个基准的实验表明,拟议方法在提高发音速度的同时,同时保持有竞争力的ER2-ST系统与状态的翻译效率。