This paper investigates a method for simulating natural conversation in the model training of end-to-end neural diarization (EEND). Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset. Simulated datasets play an essential role in the training of EEND, but as yet there has been insufficient investigation into an optimal simulation method. We thus propose a method to simulate natural conversational speech. In contrast to conventional methods, which simply combine the speech of multiple speakers, our method takes turn-taking into account. We define four types of speaker transition and sequentially arrange them to simulate natural conversations. The dataset simulated using our method was found to be statistically similar to the real dataset in terms of the silence and overlap ratios. The experimental results on two-speaker diarization using the CALLHOME and CSJ datasets showed that the simulated dataset contributes to improving the performance of EEND.
翻译:本文调查了在终端到终端神经二极化(END)示范培训中模拟自然对话的方法。 由于缺乏任何附加说明的真实对话数据集, EEND通常先在大规模模拟对话数据集上接受先期训练,然后适应目标真实数据集。 模拟数据集在EEND的培训中起着重要作用, 但是还没有对最佳模拟方法进行充分调查。 因此,我们提出了一个模拟自然对话演讲的方法。 与只是将多位演讲者的发言合并起来的传统方法不同, 我们的方法考虑到了交替考虑。 我们定义了四种演讲者转换类型并按顺序安排模拟自然对话。 使用我们的方法模拟的数据集在统计上与沉默和重叠比率方面的真实数据集相似。 使用 CAPHOME 和 CSJ 数据集的双声调对称调实验结果显示, 模拟数据集有助于改善 EEND的性能。