The transformer is a widely-used building block in modern neural networks. However, when applied to audio data, the transformer's acausal behaviour, which we term Acausal Attention (AA), has generally limited its application to offline tasks. In this paper we introduce Streaming Attention (SA), which operates causally with fixed latency, and requires lower compute and memory resources than AA to train. Next, we introduce Low Latency Streaming Attention (LLSA), a method which combines multiple SA layers without latency build-up proportional to the layer count. Comparative analysis between AA, SA and LLSA on Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) tasks are presented. The results show that causal SA-based networks with fixed latencies of a few seconds (e.g. 1.8 seconds) and LLSA networks with latencies as short as 300 ms can perform comparably with acausal (AA) networks. We conclude that SA and LLSA methods retain many of the benefits of conventional acausal transformers, but with latency characteristics that make them practical to run in real-time streaming applications.
翻译:变压器是现代神经网络中广泛使用的构件。然而,当应用到音频数据时,变压器的视视窗行为(我们称之为Aacausat Reative(AAA))一般地将其应用限于离线任务。在本文中,我们引入了气流注意(SA),它以固定的悬浮为因果运作,需要低于AAA培训的计算和记忆资源。接下来,我们引入了低纬度调调控(LLLSA),这种方法将多种不延缓累积的SA层结合在一起,与层数成比例。介绍了AAA、SA和LSA关于自动语音识别(ASR)和言语情感识别(SER)任务之间的比较分析。结果显示,固定延迟数秒(例如1.8秒)的基于SAir因果网络和短于300米的LSA网络可以与ausal(AAA)网络相匹配。我们得出结论,SA和LSA方法保留了常规的ausal 变压器的许多好处,但具有实际流应用特性。</s>