Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, via phone, character, and word embeddings. Compared to alternative methods that leverage similar information, such as multitask or adversarial training, CASE factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics and correlations with the corresponding content units to derive more expressive representations. CASE is evaluated for speaker re-clustering with a realistic speaker diarisation setup using the AMI meeting transcription dataset, where the content information is obtained by performing ASR based on an automatic segmentation. Experimental results showed that CASE achieved a 17.8% relative speaker error rate reduction over conventional methods.
翻译:最近发言者的diarization系统往往将可变长的语音区段转换成固定长度的发言者集群矢量代表,这些代表被称为 " 嵌入式 " 。本文建议采用内容觉知的发言者嵌入式(CASE)方法,扩大发言者分类器的投入,不仅包括声学特征,还包括通过电话、字符和词嵌入式等相应的语音内容。与利用类似信息的其他方法,如多任务或对抗性培训相比,CASE因子使自动语音识别(ASR)从语音识别到侧重于模拟发言者特征以及与相应内容单位的关联,以获得更清晰的表达式。CASE被评价为使用AMI会议分录数据集以现实的语音分解设置重新组合发言者重新组合,其内容信息是通过以自动分解法执行ASR获得的。实验结果表明,CASE在常规方法上实现了17.8%相对发言者错误率的降低率。