This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phonetic classes and the second layer operating at the penultimate layer is based on automatically learned acoustic classes. In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. The ASR accuracy is expected to improve if the conventional architecture of acoustic model is modified to make them more suitable to account for such overlaps. MixNet is developed keeping this in mind. Analysis conducted by means of scatter diagram verifies that MoE indeed improves the separation between classes that translates to better ASR accuracy. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates compared to the conventional models, namely, DNN and LSTM respectively, trained using sMBR criteria. In comparison to an existing method developed for phone-classification (by Eigen et al), our proposed method yields a significant improvement.
翻译:本文为自动语音识别(ASR)背景下的声学模型提供了一个全新的深层次学习结构,称为MixNet。除了传统层,例如DNN-HMM 完全连接的层层和LSTM-HMM 的内存细胞等传统层外,该模型还使用基于专家混合体(MoE)的另外两个层。投入中操作的第一部教育部层基于预先定义的宽广语音类,而在倒数层运行的第二层基于自动学习的声学等级。自然语言中,不同声学等级的分布重叠不可避免,导致阶级间错分类。如果传统的声学模型结构被修改,使之更适合核算这些重叠,预期ASR的准确性会得到改善。通过散射图分析,确定教育部确实改进了班级之间的分离,从而提高了ASR的准确性。在大型词汇上进行了实验,表明拟议的结构提供了13.6%和10.0%的文字错误率,与传统的模型相比,即DNIS和LS级标准,分别用一种经过培训的EBRM和LS质量方法进行了对比。