利用波斯语语音识别中极富进进动最大输出神经网络的时间- 时间- 时间本地化 (Time-Frequency Localization Using Deep Convolutional Maxout Neural Network in Persian Speech Recognition)

In this paper, a CNN-based structure for time-frequency localization of information in the ASR acoustic model is proposed for Persian speech recognition. Research has shown that the receptive fields' spectrotemporal plasticity of some neurons in mammals' primary auditory cortex and midbrain makes localization facilities that improve recognition performance. As biosystems have inspired many man-maid systems because of their high efficiency and performance, in the last few years, much work has been done to localize time-frequency information in ASR systems, which has used the spatial or temporal immutability properties of methods such as TDNN, CNN, and LSTM-RNN. However, most of these models have large parameter volumes and are challenging to train. We have presented a structure called Time-Frequency Convolutional Maxout Neural Network (TFCMNN) in which two parallel time-domain and frequency-domain 1D-CMNN are used. These two blocks are applied simultaneously but independently to the spectrogram, and then their output is concatenated and applied jointly to a fully connected Maxout network for classification. To improve the performance of this structure, we have used newly developed methods and models such as Dropout, maxout, and weight normalization. Two sets of experiments were designed and implemented on the Persian FARSDAT speech dataset to evaluate the performance of this model compared to conventional 1D-CMNN models. According to the experimental results, the average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models. In addition, the average training time of the TFCMNN models is about 17 hours lower than the average training time of traditional models. Therefore, as proven in other sources, we can say that time-frequency localization in ASR systems increases system accuracy and speeds up the training process.

翻译：在本文中,为波斯语语音识别提议了一个基于CNN的ASR声学模型中时间频率信息本地化结构。研究显示,哺乳动物主要听觉皮层和中脑中某些神经元的可接受字段光球时光可塑性使本地化设施提高了认知性能。随着生物系统在过去几年中由于使用两种平行时间-空间和频度1D-CMN,激励了许多人服务系统。ASR系统使用空间或时间不可调频信息的地方化结构,使用了TDNN、CNN和LSTM-RNN等方法的空间或时间性能特性。然而,这些模型中大多数都具有较大的参数数量,对培训来说具有挑战性。我们展示了一个叫做时间-自由共振动的常规神经网络(TFCNNNNNN)结构,其中两个平行的时间和频率1DNMNNNNM系统。这两个模型同时应用,但独立应用到光谱,然后它们的输出是配置和同时应用到一个完全连接的 Maxout 更高时间培训网络,用来进行更高时间的模型。我们所演示的SDADADADM的模型,这个模型的正常的正常化模型,这个模型的正常化模型的正常化模型是用来用来去度模型。