In recent years, End-to-End speech recognition technology based on deep learning has developed rapidly. Due to the lack of Turkish speech data, the performance of Turkish speech recognition system is poor. Firstly, this paper studies a series of speech recognition tuning technologies. The results show that the performance of the model is the best when the data enhancement technology combining speed perturbation with noise addition is adopted and the beam search width is set to 16. Secondly, to maximize the use of effective feature information and improve the accuracy of feature extraction, this paper proposes a new feature extractor LSPC. LSPC and LiGRU network are combined to form a shared encoder structure, and model compression is realized. The results show that the performance of LSPC is better than MSPC and VGGnet when only using Fbank features, and the WER is improved by 1.01% and 2.53% respectively. Finally, based on the above two points, a new multi-feature fusion network is proposed as the main structure of the encoder. The results show that the WER of the proposed feature fusion network based on LSPC is improved by 0.82% and 1.94% again compared with the single feature (Fbank feature and Spectrogram feature) extraction using LSPC. Our model achieves performance comparable to that of advanced End-to-End models.
翻译:近年来,基于深度学习的端到端语音识别技术发展迅速。由于缺乏土耳其语音数据,土耳其语音识别系统的性能较差。首先,本文研究了一系列语音识别调优技术。结果表明,当采用结合速度扰动和加噪声的数据增强技术,并将波束搜索宽度设置为16时,模型的性能最佳。其次,为充分利用有效的特征信息和改善特征提取的精度,本文提出了一种新的特征提取器 LSPC。LSPC和LiGRU网络结合形成共享编码器结构,并实现了模型压缩。结果表明,仅使用Fbank特征时,LSPC的性能优于MSPC和VGGnet,WER分别提高了1.01%和2.53%。最后,基于上述两点,提出了一种新的多特征融合网络作为编码器的主要结构。结果表明,基于LSPC的提出的特征融合网络的WER再次比单特征(Fbank特征和Spectrogram特征)提取使用LSPC的WER分别提高了0.82%和1.94%。我们的模型实现了与先进的端到端模型相当的性能。