Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction. In this paper, we take the first step on streaming SLU and simultaneous ST using a blockwise streaming Transformer, which is based on contextual block processing and blockwise synchronous beam search. Furthermore, we design an automatic speech recognition (ASR)-based intermediate loss regularization for the streaming SLU task to improve the classification performance further. As for the simultaneous ST task, we propose a cross-lingual encoding method, which employs a CTC branch optimized with target language translations. In addition, the CTC translation output is also used to refine the search space with CTC prefix score, achieving joint CTC/attention simultaneous translation for the first time. Experiments for SLU are conducted on FSC and SLURP corpora, while the ST task is evaluated on Fisher-CallHome Spanish and MuST-C En-De corpora. Experimental results show that the blockwise streaming Transformer achieves competitive results compared to offline models, especially with our proposed methods that further yield a 2.4% accuracy gain on the SLU task and a 4.3 BLEU gain on the ST task over streaming baselines.
翻译:虽然变换者在一些语音处理任务中取得了成功,如口语理解和语音翻译,但实现在线处理,同时保持竞争性性能对于现实世界的互动仍然至关重要。在本文件中,我们使用一个基于背景区块处理和相联同步波束搜索的块状流流变换器,在流 SLU 和同时站点上迈出第一步,使用一个串流 SLU 和同步流流流变器,在流流流流流 SLU 和语音翻译(ST) 任务中,我们设计基于自动语音识别(ASR) 的中间损失规范,以进一步提高分类性能。关于同时的ST任务,我们建议采用一种跨语言编码方法,在使用以目标语言翻译优化的CTC分支进行优化。此外,我们还使用CTC公司翻译输出器来改进搜索空间,使用一个块状流流流流流流流变换器,在首次实现CTC/保持同步波流翻译的同时,在FSC和SL PolP Cororora 上进行实验,同时对Fish-Come-CallHome Sall-C-C-C En-Decoora 任务进行评估。实验结果显示, 流变流变换流变换者在SU 上取得了一个比SL 的SL 任务基准模型,特别是SL 的SL 的SL 上,在SL 上,在SL limal la la la la 任务基线上,在SBL 任务 的计算 的计算 的计算 任务模型上将 。