用于在变换器基底流端到 End ASR 的 Shift 整块编码器 (Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR)

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of them have limitations in aspects of linear computational complexity, global context modeling, and parallel training. In this work, we aim to build a model to take all these three advantages for streaming Transformer ASR. Particularly, we propose a shifted chunk mechanism for the chunk-wise Transformer which provides cross-chunk connections between chunks. Therefore, the global context modeling ability of chunk-wise models can be significantly enhanced while all the original merits inherited. We integrate this scheme with the chunk-wise Transformer and Conformer, and identify them as SChunk-Transformer and SChunk-Conformer, respectively. Experiments on AISHELL-1 show that the SChunk-Transformer and SChunk-Conformer can respectively achieve CER 6.43% and 5.77%. And the linear complexity makes them possible to train with large batches and infer more efficiently. Our models can significantly outperform their conventional chunk-wise counterparts, while being competitive, with only 0.22 absolute CER drop, when compared with U2 which has quadratic complexity. A better CER can be achieved if compared with existing chunk-wise or memory-based methods, such as HS-DACS and MMA. Code is released.

翻译：目前,目前主要有三种基于流成流(E2E)自动语音识别(ASR)的变换器编码器流(流成)法,即时间限制方法、块法和记忆法。一般来说,所有这些方法在线性计算复杂性、全球背景建模和平行培训等方面都有局限性。在这项工作中,我们的目标是为流成变变变变变器ASR建立一个模型,以利用所有这些优势。特别是,我们为块状变换器建议一个转移块状机制,提供块块间的交叉组合连接。因此,块形模型的全球背景建模能力可以大大加强,而所有原有优点都可继承。我们把这个方案与粗形变变换器和变异变异器结合起来,并把它们分别确定为Schunk-Transforent和Schunk-Conforld。在Achunk-Trading-Trading-Trading-Trading-Tradinger-1实验中显示,Schunk-Tradings basbly 和5.77 %。由于线性复杂度复杂度使得它们可以与大相联化和变变变变变变后,我们的模式可以大大地将S-RIS,我们的模式可以比为S-RDRRBRFR,而使其常规方法比为比较后,我们的模式可以明显地使其常规方法比为更精制成。