动态层过滤器对语音识别模型准确度的影响 (Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy)

from arxiv, 8 pages, 9 figures, 3 tables, to be published in the Proc. of the 19th IEEE International Conference on Machine Learning and Applications, Page 971-978, 2020. DOI 10.1109/ICMLA51294.2020.00158. \c{opyright} 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising purposes

Inspired by the progress of the End-to-End approach [1], this paper systematically studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks adding to Recurrent Neural Networks) for ASR Models (Automatic Speech Recognition). Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model, otherwise some parameter ranges of CNN can render it useless to add the CNN to the RNN model. Our results show a strong dependency of word accuracy on the Number of Filters of convolutional layers. Based on the experimental results, the paper suggests a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to explain the above observations. Based on this Embedding hypothesis and the optimization of parameters, the paper develops an End-to-End speech recognition system which has a high word accuracy but also has a light model-weight. The developed LVCSR (Large Vocabulary Continuous Speech Recognition) model has achieved quite a high word accuracy of 90.2% only by its Acoustic Model alone, without any assistance from intermediate phonetic representation and any Language Model. Its acoustic model contains only 4.4 million weight parameters, compared to the 35~68 million acoustic-model weight parameters in DeepSpeech2 [2] (one of the top state-of-the-art LVCSR models) which can achieve a word accuracy of 91.5%. The light-weighted model is good for improving the transcribing computing efficiency and also useful for mobile devices, Driverless Vehicles, etc. Our model weight is reduced to ~10% the size of DeepSpeech2, but our model accuracy remains close to that of DeepSpeech2. If combined with a Language Model, our LVCSR system is able to achieve 91.5% word accuracy.

翻译：受 End-End 方法[1] 进展的启发,本文件系统地研究了变异层过滤器数量对 ASR 模型(自动语音识别) 的 CNN+NN 的模型预测准确度的影响。实验结果显示,只有在CNN 过滤器数量超过一定阈值时,CNN 才能将CNN添加到能够改进CNN+RNN 语音识别模型性能的 RNNN 。否则CNN的某些参数重量使将CNN 添加到 RNN 模型上变得毫无用处。我们的结果显示, CNN+RNN( 革命级神经网络在常规神经网络中添加的) 9-RNNNN( 革命神经网络网络网络网络) 的模型预测准确度非常可靠。基于实验结果, CNNNW CNN 过滤器数量超过一定阈值的阈值, CNN+RNN( CNN) 高音阶语音识别系统只能改进En- Enter 音阶语音识别系统, 也只有LVC 的低音序的精度精度精度精度精度精确度。