This paper presents an in-depth study on a Sequentially Sampled Chunk Conformer, SSC-Conformer, for streaming End-to-End (E2E) ASR. The SSC-Conformer first demonstrates the significant performance gains from using the sequentially sampled chunk-wise multi-head self-attention (SSC-MHSA) in the Conformer encoder by allowing efficient cross-chunk interactions while keeping linear complexities. Furthermore, it explores taking advantage of chunked convolution to make use of the chunk-wise future context and integrates with casual convolution in the convolution layers to further reduce CER. We verify the proposed SSC-Conformer on the AISHELL-1 benchmark and experimental results show that a state-of-the-art performance for streaming E2E ASR is achieved with CER 5.33% without LM rescoring. And, owing to its linear complexity, the SSC-Conformer can train with large batch sizes and infer more efficiently.
翻译:本文件深入研究了SSC-Connect, 用于分流端到端端(E2E) ASR。SSC-Confer首先表明,通过允许高效的跨堂互动,同时保持线性复杂性,在Confect 编码中使用按顺序抽样的块状多头自省(SSC-MHSA),取得了显著的绩效收益。此外,它探索了利用块状混凝土组合利用块状组合式未来环境,并与组合层中的临时合并,以进一步减少CER。我们核实了拟议的AISELL-1基准和实验结果,表明在AISELL-1基准和实验结果上使用SSC-C-Conder,在流出E2E ASR方面实现了最先进的性能,因为CER5.33%没有LM Recuring。并且由于线性的复杂性,SSC-Confrent可以进行大批量和更高效的训练。