The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises from the fact that translation is a non-monotonic sequence transduction task due to word ordering differences between languages -- this clashes with the monotonic nature of ASR. Therefore, we propose to generate ST tokens out-of-order while remembering how to re-order them later. We achieve this by predicting a sequence of tuples consisting of a source word, the corresponding target words, and post-editing operations dictating the correct insertion points for the target word. We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations from the same speech input simultaneously. We apply our approach to offline and real-time streaming models, demonstrating that we can provide explainable translations without sacrificing quality or latency. In fact, the delayed re-ordering ability of our approach improves performance during streaming. As an added benefit, our method performs ASR and ST simultaneously, making it faster than using two separate systems to perform these tasks.
翻译:端对端语音翻译( E2E ST) 系统的黑箱性质使得人们难以理解源语言投入是如何被映射到目标语言的。 为了解决这个问题,我们希望同时生成自动语音识别( ASR) 和 ST 预测, 使每个源语言的字被明确映射到目标语言的字。 一项重大挑战来自这样的一个事实: 翻译是一个非声波序列的转换任务, 原因是语言的排列有差异 -- 与 ASR 的单调性质有冲突。 因此, 我们提议生成ST 标牌出局, 同时记住如何在以后重新排序它们。 我们通过预测由源词、 相应目标词和 ST 等组成的图例序列, 以及指定目标词正确插入点的后编辑操作。 我们检查了这种操作序列的两种变式, 能够同时生成单调解调和不调调音译出相同的语音输入。 我们建议对脱线和实时流模式采用我们的方法, 表明我们可以提供可解释的翻译, 而不是牺牲质量或 挂载能力。 在运行 A 系统时, 运行过程中, 执行这些变更快速的方法 。