Transformers have been the dominant architecture for Speech Translation in recent years, achieving significant improvements in translation quality. Since speech signals are longer than their textual counterparts, and due to the quadratic complexity of the Transformer, a down-sampling step is essential for its adoption in Speech Translation. Instead, in this research, we propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation. Furthermore, we introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead. Speech-to-Text Perceivers with DLA can match the performance of Transformer baselines across three language pairs in MuST-C. Finally, a DLA-trained model is easily adaptable to DLA at inference, and can be flexibly deployed with various computational budgets, without significant drops in translation quality.
翻译:近年来,变换器一直是语音翻译的主导架构,实现了翻译质量的显著改善。 由于语音信号比文本信号长,并且由于变换器的四重复杂性,因此在语音翻译中采用降序步骤至关重要。 相反,我们建议通过使用 Perceiver 编码器将语音输入映射成固定长度潜值表示法来缓解复杂性。 此外,我们引入了一种新的培训 Perceivers的方法, 使用动态缓存访问( DLA) 来培训 Perceivers, 在不增加计算间接费用的情况下释放更大的潜在空间。 与 DLA 的语音到 Text Perceivers 可以匹配MUST-C三对语言的变换基线性能。 最后, DLA 培训的模型很容易适应 DLA 的推断, 并且可以灵活地运用各种计算预算, 而不会显著降低翻译质量。</s>