RNN-T models have gained popularity in the literature and in commercial systems because of their competitiveness and capability of operating in online streaming mode. In this work, we conduct an extensive study comparing several prediction network architectures for both monotonic and original RNN-T models. We compare 4 types of prediction networks based on a common state-of-the-art Conformer encoder and report results obtained on Librispeech and an internal medical conversation data set. Our study covers both offline batch-mode and online streaming scenarios. In contrast to some previous works, our results show that Transformer does not always outperform LSTM when used as prediction network along with Conformer encoder. Inspired by our scoreboard, we propose a new simple prediction network architecture, N-Concat, that outperforms the others in our on-line streaming benchmark. Transformer and n-gram reduced architectures perform very similarly yet with some important distinct behaviour in terms of previous context. Overall we obtained up to 4.1 % relative WER improvement compared to our LSTM baseline, while reducing prediction network parameters by nearly an order of magnitude (8.4 times).
翻译:RNN-T模型因其在在线流模式中的竞争力和操作能力而在文献和商业系统中越来越受欢迎。在这项工作中,我们开展了一项广泛的研究,比较了单调和原始RNN-T模型的若干预测网络结构。我们比较了基于共同的最新连接编码器的四种类型的预测网络,并报告了Librispeech和内部医疗谈话数据集的结果。我们的研究涵盖了离线批量模式和在线流流方案。与以前的一些工作不同,我们的结果显示,变换器在与 Conferent 编码器一起用作预测网络时,并不总是优于LSTM系统。在我们的计分板的启发下,我们提出了一个新的简单的预测网络结构,N-Concat,这在我们的在线流基准中优于其他网络。变换器和n-gram缩小的建筑表现非常相似,在以前的情况中也有一些重要的不同行为。总的来说,与我们的LSTM基准相比,我们得到了4.1%的相对WER改进,同时将预测网络参数缩小了近一个程度(8.4倍)。