Recently, conformer-based end-to-end automatic speech recognition, which outperforms recurrent neural network based ones, has received much attention. Although the parallel computing of conformer is more efficient than recurrent neural networks, the computational complexity of its dot-product self-attention is quadratic with respect to the length of the input feature. To reduce the computational complexity of the self-attention layer, we propose multi-head linear self-attention for the self-attention layer, which reduces its computational complexity to linear order. In addition, we propose to factorize the feed forward module of the conformer by low-rank matrix factorization, which successfully reduces the number of the parameters by approximate 50% with little performance loss. The proposed model, named linear attention based conformer (LAC), can be trained and inferenced jointly with the connectionist temporal classification objective, which further improves the performance of LAC. To evaluate the effectiveness of LAC, we conduct experiments on the AISHELL-1 and LibriSpeech corpora. Results show that the proposed LAC achieves better performance than 7 recently proposed speech recognition models, and is competitive with the state-of-the-art conformer. Meanwhile, the proposed LAC has a number of parameters of only 50% over the conformer with faster training speed than the latter.
 翻译:最近,基于合规的端到端自动语音识别比经常性神经网络基于神经网络的频率高得多。虽然对匹配器的平行计算比经常性神经网络效率更高,但其点产品自控的计算复杂性相对于输入特性的长度而言是四边式的。为了降低自控层的计算复杂性,我们建议多头线性自控自控层,这将自控层的计算复杂性降低到线性顺序。此外,我们提议采用低级别矩阵因子化来将自控器的进料前导模块作为因素,从而成功地将参数数量减少约50%,而性能损失很少。拟议的模式(以线性能为基准的自控点(LAC),可以与连接性时间分类目标一起进行培训和推论。为了评价拉加自控层的效能,我们进行了AISELLL-1和LiPech Corora的实验。结果显示,拟议的拉加组的进料模块的性能比7个低,仅比拟议的50个语音参数的升级,后者的合规率比拟议的升级。