Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to improve the execution speed, including replacing lower conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Our optimized conformer can be readily incorporated into a cascaded-encoder setting, allowing a second-pass decoder to operate on its output and improve the accuracy whenever more resources are available. Altogether, we find that these optimizations can reduce latency by a factor of 6.8x, and come at a reasonable trade-off in quality. With the cascaded second-pass, we show that the recognition accuracy is completely recoverable. Thus, our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
翻译:Conformer模型保留大量的内部状态,其中绝大部分与self-attention层相关。在有限的内存带宽下,每个推断步骤从内存中读取这些状态可能会减慢推断速度。本文设计了一种优化的Conformer,其大小足以满足设备要求并且TPU上具有快速的推断速度。我们探讨了多种方法来提高执行速度,包括用仅包含卷积的块替换Conformer底部的块、战略性地缩小架构以及利用RNN-Attention-Performer。我们优化的Conformer可以轻松地组合进级联编码器设置中,允许第二遍解码器对其输出进行操作,并在更多资源可用时提高准确性。总体而言,我们发现这些优化可以将延迟减少6.8倍,并且在质量和妥协方面具有合理的平衡。使用级联第二遍,我们展示了识别精度完全可恢复。因此,我们提出的编码器可以作为独立的强大编码器在设备上运行,也可以作为高性能ASR管道的第一部分。