We introduce Amortized Neural Networks (AmNets), a compute cost- and latency-aware network architecture particularly well-suited for sequence modeling tasks. We apply AmNets to the Recurrent Neural Network Transducer (RNN-T) to reduce compute cost and latency for an automatic speech recognition (ASR) task. The AmNets RNN-T architecture enables the network to dynamically switch between encoder branches on a frame-by-frame basis. Branches are constructed with variable levels of compute cost and model capacity. Here, we achieve variable compute for two well-known candidate techniques: one using sparse pruning and the other using matrix factorization. Frame-by-frame switching is determined by an arbitrator network that requires negligible compute overhead. We present results using both architectures on LibriSpeech data and show that our proposed architecture can reduce inference cost by up to 45\% and latency to nearly real-time without incurring a loss in accuracy.
翻译:我们引入了模拟神经网络(AmNets),这是一个计算成本和长期智能网络结构,特别适合序列建模任务。我们将AmNets应用到经常性神经网络转换器(RNN-T),以降低自动语音识别(ASR)任务的计算成本和延迟度。AmNets RNN-T架构使网络能够根据框架框架框架在编码分支之间动态转换。分支是用可变的计算成本和模型能力构建的。在这里,我们为两种众所周知的候选技术实现变量的计算:一种技术使用稀疏的理算法,而另一种技术则使用矩阵要素化。框架框架框架转换由仲裁员网络决定,需要微不足道的计算间接费用。我们用LibSpeech数据的两个架构来介绍结果,并表明我们提议的架构可以将推断成本降低到45 ⁇ 和延迟度,从而在不造成准确性损失的情况下几乎实时。