Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction.
翻译:最近,在多党会议分析中成功应用了混合组群和神经二分化模式的混合系统,然而,目前的模式总是将重叠的发言者二分化作为多标签分类问题处理,因为对发言者的依赖性和重叠性没有很好地加以考虑。为了克服缺点,我们通过拟议的电源组编码(PSE)将重叠的发言者二分化任务重新表述为一个单一标签预测问题。通过这种配方,可以明确地模拟发言者的依赖性和重叠性。为了充分利用这种配方,我们进一步提议发言者双分化(SOND)模式,它包括一个根据背景独立的(CI)分级(SOND)模式,以模拟全球发言者的不协调性,一个根据背景的分级(CD)模式来模拟当地的不协调性,以及一个合并网络(SCN)来合并和重新指派发言者的活动。实验结果显示,使用拟议的配方可以超越基于目标发言者语音活动检测的状态方法,并且可以进一步改进与SOND的性能,从而导致6.30%的相对分化误差减少。