Punctuation and Segmentation are key to readability in Automatic Speech Recognition (ASR), often evaluated using F1 scores that require high-quality human transcripts and do not reflect readability well. Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability, especially in conversational speech devoid of strict grammatical structures. Large pre-trained models capture a notion of grammatical structure. We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems. We validate our approach with human experts. Additionally, our approach enables quantitative assessment of text post-processing techniques such as capitalization, inverse text normalization (ITN), and disfluency on overall readability, which traditional word error rate (WER) and slot error rate (SER) metrics fail to capture. TRScore is strongly correlated to traditional F1 and human readability scores, with Pearson's correlation coefficients of 0.67 and 0.98, respectively. It also eliminates the need for human transcriptions for model selection.
翻译:在自动语音识别(ASR)中,校准和分解是读取能力的关键,通常使用需要高质量人文记录并不能很好地反映可读性的F1评分进行评估。人文评估费用昂贵、耗时,且在观察者之间变化很大,特别是在没有严格的语法结构的谈话性演讲中。大型的预培训模式反映了语法结构的概念。我们介绍了TRScore,这是使用GPT模型评估不同分解和标分系统的新易读性措施。我们与人类专家验证了我们的方法。此外,我们的方法还使得能够对文本处理后处理技术进行定量评估,例如资本化、反文本正常化和整体可读性不易读性,而传统的单词错误率和时间差率指标无法捕捉到。TRScore与传统的F1和人可读性分数密切相关,Pearson的关联系数分别为0.67和0.98。它也消除了模型选用人文笔录的需要。