A good representation of a target speaker usually helps to extract important information about the speaker and detect the corresponding temporal regions in a multi-speaker conversation. In this paper, we propose a neural architecture that simultaneously extracts speaker embeddings consistent with the speaker diarization objective and detects the presence of each speaker frame by frame, regardless of the number of speakers in the conversation. To this end, a residual network (ResNet) and a dual-path recurrent neural network (DPRNN) are integrated into a unified structure. When tested on the 2-speaker CALLHOME corpus, our proposed model outperforms most methods published so far. Evaluated in a more challenging case of concurrent speakers ranging from two to seven, our system also achieves relative diarization error rate reductions of 26.35% and 6.4% over two typical baselines, namely the traditional x-vector clustering system and the attention-based system.
翻译:目标发言者的良好表现通常有助于获取关于发言者的重要信息,并在多发言者对话中探测相应的时间区域。 在本文中,我们建议建立一个神经结构,同时提取与发言者分化目标一致的发言者嵌入内容,并按语框检测每个发言者框架的存在,而不论对话中发言者人数多寡。为此,一个剩余网络(ResNet)和一个双路经常性神经网络(DPRNN)被整合到一个统一的结构中。在对2位发言者的CallHOME Cample进行测试时,我们提议的模型比迄今为止公布的大多数方法都好。在一个更具挑战性的情况下,即对2到7位同时发言者进行了评估,我们的系统还实现了相对分化误差率减少26.35%和6.4%,超过两个典型的基线,即传统的X-病毒集成系统和关注系统。