A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.
翻译:级联语音翻译模型依赖于离散和无差别的校正,它从源侧提供监督信号,有助于源言和目标文本之间的转换。这种模型在ASR和MT模型之间出现错误传播。直接语音翻译是避免错误传播的一种替代方法;但是,它的性能往往是在级联系统后面。为了使用中间表示法并保护端到端的可训练性,以前的研究曾建议使用两阶段模型,将识别器的隐性矢量传递到MT模型的解码器中,并忽略MT编码器。这项工作探索了将整个级联组件拆成一个单一端到端的可培训模型的可行性,为此优化了ASR和MT模型的所有参数,同时不忽略任何已学的参数。这是一种严格的集成方法,将源词后端分布作为软决定,而不是一热矢量,从而得以回调。因此,它提供了解译和翻译,并实现了它们之间的强烈一致性。我们在四个数据情景上的实验显示,在不同的数据假设中,通过优化ASR和MTER模型中,将模型比1.8%和BAV级模型比高级模型高出了1.8%。