End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels to mitigate the noise introduced by positive-negative pairs in self-supervised MCPC. Experiments on reading and spontaneous speech datasets show that our proposed approach achieves 2.8 and 1.4 points Character Error Rate (CER) absolute reductions compared to the baseline, respectively.
翻译:终端到终端自动语音识别模型通常经过培训,以优化整个象征性序列的丢失,同时忽视显眼语音-突触监督,从而导致类似电话混乱或减少电话机的减少导致识别错误。为了缓解这一问题,我们提议了一个基于监控对终端到终端自动语音识别(SCala)的新框架,以加强终端到终端自动语音识别系统的语音代表学习。具体地说,我们将自我监管的遮蔽式对称预测编码(MCPC)推广到一个完全监督的设置,在这种设置中,监管在以下方式中实施。首先,根据电话界限,SCaLa遮罩可变长的编码器特征,根据电话界限,从事先培训的音频调模型中提取;然后通过对比学习预测遮蔽的特征。强制连接可提供电话标签,以缓解自监管的MCPC(MC)中正反对配对引入的噪音。在阅读和自发语音数据集方面的实验显示,我们拟议的方法分别达到2.8和1.4点,与绝对位率(CER)对比的绝对差率分别达到2.8和1.4点。