In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.4%/5.3% on test-clean/test-other subsets of LibriSpeech, which is competitive with the supervised baseline's performance.
翻译:在本文中,我们研究在一个监督不力的环境中自动语音识别系统的培训,那里的音频培训数据记录标签的文字顺序不详。我们培训一个字级声学模型,该模型利用LogSumExpt 操作汇总所有输出框架的分布,并使用跨元素损失与地面真实字分布相匹配。我们使用该模型在培训成套材料上产生的假标签,然后用Commiteist Temoral分类损失来培训一个基于字母的声学模型。我们的系统在LibriSpeech的测试-清洁/测试其他子集上达到2.4%/5.3%,该子集与受监督的基线性能相比具有竞争力。