End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models trained on public sets as well as the implementation to efficiently handle multiple speakers per conversation and an auxiliary voice activity detection loss.
翻译:端到端二分化是标准级联二分化系统的一个有吸引力的替代方法,因为一个单一系统可以同时处理任务的所有方面。许多端到端模式已经提出,但所有这些模式都需要大量附加说明的数据用于培训(目前尚未存在)。折中解决方案包括生成合成数据,而最近提出的模拟对话(SC)比原模拟混合物(SM)有了显著改进。在这项工作中,我们创建了每场谈话有多个发言者的SC,并显示它们能够大大改善工作绩效,也减少了对微调阶段的依赖。我们还以宽频公共音频源创建SC,并对几个评价组进行了分析。与这一出版物一起,我们发布了生成这类数据和在公共组中经过培训的模型的食谱,以及实施在每次谈话中高效处理多位发言者和辅助语音活动探测损失的方法。</s>