We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conformer gives an additional gain over the Transformer-based EEND. However, we notice that the Conformer-based EEND does not generalize as well from simulated to real conversation data as the Transformer-based model. This leads us to quantify the mismatch between simulated data and real speaker behavior in terms of temporal statistics reflecting turn-taking between speakers, and investigate its correlation with diarization error. By mixing simulated and real data in EEND training, we mitigate the mismatch further, with Conformer-based EEND achieving 24% error reduction over the baseline SA-EEND system, and 10% improvement over the best augmented Transformer-based system, on two-speaker CALLHOME data.
翻译:我们提出一个新的端到端神经二亚化系统(EEND),这个系统以Confront为基础,这是最近提出的神经结构,将进化图象和变异器结合起来,以模拟当地和全球的言论依赖性。我们首先显示,数据增强和进化子抽样层加强了以变异器为基础的EEND的原始自我强化 EEND,然后Confor给以变异器为基础的EEND带来额外收益。然而,我们注意到,基于前导的EEND没有将模拟的数据与以变异器为基础的真实对话数据进行概括化,也没有将模拟数据与真实的演讲者行为之间的不匹配量化,并调查其与变异性误差之间的关系。通过在以变器为基础的 EEND 培训中将模拟数据与真实数据相结合,我们进一步减轻不匹配,而以变异器为基础的EEND在以二位音频的AYEME数据上将基准系统减少24%的误差,对以最佳变异器为基础的系统改进10%。