Recent years have witnessed significant improvement in ASR systems to recognize spoken utterances. However, it is still a challenging task for noisy and out-of-domain data, where substitution and deletion errors are prevalent in the transcribed text. These errors significantly degrade the performance of downstream tasks. In this work, we propose a BERT-style language model, referred to as PhonemeBERT, that learns a joint language model with phoneme sequence and ASR transcript to learn phonetic-aware representations that are robust to ASR errors. We show that PhonemeBERT can be used on downstream tasks using phoneme sequences as additional features, and also in low-resource setup where we only have ASR-transcripts for the downstream tasks with no phoneme information available. We evaluate our approach extensively by generating noisy data for three benchmark datasets - Stanford Sentiment Treebank, TREC and ATIS for sentiment, question and intent classification tasks respectively. The results of the proposed approach beats the state-of-the-art baselines comprehensively on each dataset.
翻译:近年来,ASR系统在承认口语方面有了显著改进,然而,对于音响和外域数据来说,这仍然是一项艰巨的任务,在这种数据中,替换和删除错误在转录文本中很普遍,这些错误大大降低了下游任务的业绩。在这项工作中,我们提出了一个称为PhonemeBERT的BERT式语言模型,即PhonemeBERT,该模型学习一种配有电话序列和ASR笔录的联合语言模型,以学习对ASR错误具有很强的读音觉表现。我们表明,PhonemeBERT可用于下游任务,使用电话序列作为附加特征,也可用于低资源设置中,我们只有下游任务的ASR标本,而没有电话信息。我们广泛评价我们的方法,为三个基准数据集 -- -- Stanford Sentiment Treebank、TREC和ATIS -- -- 生成噪音数据,分别用于情绪、问题和意图分类任务。我们提议的方法的结果超越了每个数据集上的最新基线。