Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes, and is shown to yield improved results when compared to the audio-only approach.
翻译:发言人的分组是传统演讲人分化系统的一个基本步骤,通常作为只用音频的语音处理任务加以处理。但是,对话参与者使用的语言包含了有助于改进组合性能的额外信息。在诸如商务会议、访谈和讲座等对话互动中尤其如此,在这些互动中,对话者(管理人员、客户、教师等)所扮演的具体角色往往与可区分的语言模式相关联。在本文件中,我们提议使用一个以文本为基础的监督模式来抽取演讲人的角色,然后利用这些信息来指导基于声音的频谱组合步骤,在部分之间施加必须链接和无法连接的限制。拟议的方法适用于两个不同领域,即医疗互动和播客事件,并表明与只使用音法相比,可以产生更好的效果。