利用波斯语语音识别中极富进进动最大输出神经网络的时间- 时间- 时间本地化 (Time-Frequency Localization Using Deep Convolutional Maxout Neural Network in Persian Speech Recognition)

In this paper, a CNN-based structure for time-frequency localization of audio signal information in the ASR acoustic model is proposed for Persian speech recognition. Research has shown that the receptive fields' time-frequency flexibility in some mammals' auditory neurons system improves recognition performance. Biosystems have inspired many artificial systems because of their high efficiency and performance, so time-frequency localization has been used extensively to improve system performance. In the last few years, much work has been done to localize time-frequency information in ASR systems, which has used the spatial immutability properties of methods such as TDNN, CNN and LSTM-RNN. However, most of these models have large parameter volumes and are challenging to train. In the structure we have designed, called Time-Frequency Convolutional Maxout Neural Network (TFCMNN), two parallel blocks consisting of 1D-CMNN each have weight sharing in one dimension, are applied simultaneously but independently to the feature vectors. Then their output is concatenated and applied to a fully connected Maxout network for classification. To improve the performance of this structure, we have used newly developed methods and models such as the maxout, Dropout, and weight normalization. Two experimental sets were designed and implemented on the Persian FARSDAT speech data set to evaluate the performance of this model compared to conventional 1D-CMNN models. According to the experimental results, the average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models. In addition, the average training time of the TFCMNN models is about 17 hours lower than the average training time of traditional models. As a result, as mentioned in other references, time-frequency localization in ASR systems increases system accuracy and speeds up the model training process.

翻译：在本文中,为波斯语语音识别提议了一个基于CNN的ASR音响模型中音频信号信息时间频率本地化结构。研究显示,某些哺乳动物听觉神经系统的接收场时间-频率灵活性提高了认知性能。生物系统激励了许多人工系统,因为其效率和性能高,因此广泛使用了时间-频率本地化来提高系统性能。过去几年,在ASR系统中,为将时间-频率信息本地化做了大量工作,该系统使用了诸如TDNN、CNN和LSTM-RNN等方法的空间不可移动性特性。然而,这些模型中的大多数具有较大的参数数量,而且具有培训的挑战性。在这种结构中,我们设计了称为时间-频变变变变变变的MASNNNNNN网络(TNMNNNNNNN),由1D-MNMNNNN的重量共享,同时应用但独立地对特性矢变换。然后,将其输出归为完全连通的 Maxout 分类。但是,为了改进这一结构的性能性能结构,我们使用了新开发的SDFNFA和SDMA平均时间模型,我们使用了SDMA的正常化模型, 标准模型, 和最高级的模型比SDMDMDMDMA的模型, 和最高级的SDMDMDA的模型比的模型, 和最高级的SDA。在SDA的模型是用来的模型。