End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated data for training but available annotated data are scarce. Thus, EEND works have used mostly simulated mixtures for training. However, simulated mixtures do not resemble real conversations in many aspects. In this work we present an alternative method for creating synthetic conversations that resemble real ones by using statistics about distributions of pauses and overlaps estimated on genuine conversations. Furthermore, we analyze the effect of the source of the statistics, different augmentations and amounts of data. We demonstrate that our approach performs substantially better than the original one, while reducing the dependence on the fine-tuning stage. Experiments are carried out on 2-speaker telephone conversations of Callhome and DIHARD 3. Together with this publication, we release our implementations of EEND and the method for creating simulated conversations.
翻译:终端到终端神经二分化(EEND)是目前发言者二分化中最突出的研究课题之一。 EEND为标准的级联二分化系统提供了一种有吸引力的替代方法,因为一个单一系统同时接受了处理整个二分化问题的培训。但是,正在提出若干EEND变式和方法,所有这些模型都需要大量的附加说明的培训数据,但附加说明的数据却很少。因此,EEND的作品大多使用模拟混合物进行培训。然而,模拟混合物在许多方面并不象真实的谈话。在这项工作中,我们提出了一种创造合成谈话的替代方法,它与真实对话中估计的暂停和重叠的分布相仿。此外,我们分析了统计数据来源、不同增强值和数据数量的效果。我们证明我们的方法比原始方法要好得多,同时减少了对微调阶段的依赖。在Callhome和DIHARD 3的2位电话交谈上进行了实验。我们连同这一出版物一起发布了EEND的应用和模拟对话的方法。