High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.
翻译:特定领域的高质量数据标签成本昂贵,耗费人的时间。 在这项工作中,我们提出一个基于迭代伪支持匹配算法的自监督域适应方法。 生成的校正用于定制端对端自动语音识别(ASR)和迭代完善。 该算法以由ASR种子制作的、经过外部数据培训、在连接时间分类(CTC)损失中最优化的自框架性字符后台词填充。 校正是在广播电视机堆中迭接的。 这一过程通过减少要对齐或扩大校正窗口的文本数量来重复,直到找到最佳可能的音频文本校准。 初始时间戳或时间锚是根据最后对齐的直线评。 这个评分是用CT- 匹配矩阵路径计算的。 使用这种方法,不需要人对文本引用。 具有低质量校正的长音频文档(如电视字幕)的校正(SIS20), 由信任分分和双轨调整窗口校准窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口窗口的校正校准来进行进一步的校正调整。