In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
翻译:在本文中,我们就多方会议设想中由发言者提供的自动语音识别(SA-ASR)进行了一项比较研究,这是一个在满足丰富抄录方面日益受到重视的专题,在多党会议设想中,该主题日益受到关注。具体地说,本研究报告评价了三种方法。第一种方法,即FD-SOT, 包括一个框架级分解模式,用以识别发言者和多对话者ASR, 以识别发音; 由发言者提供的分解代码通过调整分解结果和公认的假设获得。然而,由于模块独立性,这种统一战略可能会因错误的时间戳错而受到影响,严重妨碍示范性业绩。因此,我们建议采用第二种方法,即WD-SOT, 通过采用字级分解模式,解决调整错误。为了进一步缓解调和问题,我们提出了第三个方法,即TS-ASR, 培训一个目标分解模块和一个ASR模块。 通过比较每一种SA-ASR方法,在实际会议设想中实验结果,AliM-SDFS-SDA, 比较性减少率方法在10-SD-SD-SDM-SDFD方法中还显示,在降低平均方法中也显示了相对-OD-SD-SD-SD-SD-SDM-SD-SD-SD-SD-SD-SDM-S-S-SD-S-S-SD-SD-SD-S-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-SD-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-SD-SD-S-S-SD-SD-SD-SDM-SDM-SDM-SD-SD-SD-SD-SD-S-S-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-