In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems.
翻译:在先前的工作中,我们提出了用于语音识别的具有歧视性的自动编码器(DcAE) 。 DcAE 将两种培训计划合并为一种。 首先,由于 DcAE 的目的是学习编码器-代码器绘图,重建后的语音和输入式语音之间的方差最小化。 其次,在代码层中,基于框架的语音嵌入是通过最大限度地减少地面真实标签和预测的三声状态分数之间的绝对交叉孔径而获得的。 DcAE 是根据Kaldi工具包开发的,将各种TDNN模型作为编码器。 在本文中,我们进一步提出了三种新版本的DCAE 。 首先,一个新的目标功能既考虑到地面真相和预测的三声器状态序列之间的绝对交叉和相互信息。 由此形成的DCAEE是基于链的DAE(c-DAE) 。 为了应用强有力的语音识别,我们进一步将C-DAE 扩展到了等级和平行结构,从而导致HCAE 和P-DAE 的语音缩略图中选择和AFAR-DA值之间加强了两个模型。