Automatic Speech Recognition (ASR) is an imperfect process that results in certain mismatches in ASR output text when compared to plain written text or transcriptions. When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy to reduce said mismatch and prevent degradations, is to hallucinate what the ASR outputs would be given a gold transcription. Prior work in this domain has focused on modeling errors at the phonetic level, while using a lexicon to convert the phones to words, usually accompanied by an FST Language model. We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence. This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task, while additionally exploring an in-between scenario when limited characterization data from the test ASR system is obtainable. To verify the extrinsic validity of the method, we also use our hallucinated ASR errors to augment training for a spoken question classifier, finding that they enable robustness to real ASR errors in a downstream task, when scarce or even zero task-specific audio was available at train-time.
翻译:自动语音识别( ASR) 是一个不完善的过程, 与普通书面文本或抄录文本相比, 自动语音识别( ASR) 是一个不完善的过程, 导致 ASR 输出文本与纯书面文本或抄录文本出现某些不匹配。 当将纯文本数据用于培训口语理解或ASR 系统时, 一个经过实践证明的减少上述不匹配和防止退化的战略, 就是对ASR 输出结果产生幻觉, 以幻觉显示ASR 输出的系统( ASR ) 的输出结果产生幻觉。 这个领域以前的工作重点是在语音层面模拟错误, 同时使用词典将电话转换成文字, 通常辅之以 FST 语言模型 。 我们推出新的端对端模型, 直接预测 ASR 的单词排序结果, 以输入的单词序列和对应的电话序列为条件。 这改进了以前出版的ASR 系统内部抄录的未读出错误结果, 以及ASR 系统从不相关任务的音频调中外的校正, 同时还探索在测试 ASR 系统可获取有限定性数据时, 的列列中可以获取的情景之间的预想。