Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.
翻译:常规ASR系统使用框架级电话后传机制来进行部队对齐(FA)和提供时间戳,而终端至终端的ASR系统,特别是AED系统,则缺乏这种能力;本文件提议进行时间戳预测~(TP),同时确认在非侵略性ASR模型中利用连续集成和射击-(CIF)机制,在非侵略性ASR模型中采用连续集成和射击-(CIF)机制,在CIF的防火点偏差问题上,我们进行后处理战略,包括防火和静默插入;此外,我们提议使用规模化CIF来平缓冲CIF产出的重量,这已证明对ASR和TP任务都有好处。 采用累积平均平均集成的移动~(AAS)和分解误差率~(DER)来衡量时间戳的质量,我们比较拟议的系统和常规混合部队对接力系统的这些衡量标准。关于人工标定时间戳的试验结果表明,拟议的优化方法大大改进了CIF系统的时间戳的准确性,将CIF系统对A7-AAS的相对降为AAA和REDA的A要求降为A。