Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network. In these models, the Transformer, a new sequence-to-sequence attention-based model relying entirely on self-attention without using RNNs or convolutions, achieves a new single-model state-of-the-art BLEU on neural machine translation (NMT) tasks. Since the outstanding performance of the Transformer, we extend it to speech and concentrate on it as the basic architecture of sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese. Additionally, a greedy cascading decoder with the Transformer is proposed for mapping CI-phoneme sequences and syllable sequences into word sequences. Experiments on HKUST datasets demonstrate that syllable based model with the Transformer performs better than CI-phoneme based counterpart, and achieves a character error rate (CER) of \emph{$28.77\%$}, which is competitive to the state-of-the-art CER of $28.0\%$ by the joint CTC-attention based encoder-decoder network.
翻译:从顺序到顺序的注意模型最近显示了自动语音识别(ASR)任务方面非常有希望的结果,自动语音识别(ASR)任务将声学、发音和语言模型整合成单一神经网络。在这些模型中,完全依靠自己注意而不使用RNN或连动的基于序列到序列的基于关注的新模型变换器,完全依靠自控的新的序列模式,实现了一个新的单一模型,即神经机翻译(NMT)的BLEU。由于变换器的出色表现,我们将其扩展为语音,并集中关注于它,作为中译机中音序列到后后注意模型的基本结构。此外,我们调查基于符号的模型和基于内语的基于内语的变换器的线模式(CI-phoneme)之间的比较。此外,提议与变换器(NMMT)的贪婪的解码解码解码解码器,用于绘制CIC-话序列和对单词序列的顺序。在KKOST-RC-DE数据库上进行的实验显示基于S-R-R-rent Recard 的模型,这是基于S-rage-rental-rage-rage-rate-rmax-rmas basy-r) 和基于Sy-rence-ral-ration-rmais-rmaisal-rmad 的S-rmad-rmad 的S-rmax 的S-rmax。