In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit.
翻译:在本文中,我们提出了一个共同执行以下三项任务的新框架:发言者的diariz化、语音分隔和发言者的计数。我们提议的框架将基于端到端神经二分化(END)模型的发言者的diariz化(EEND)模型、与以编码器-解码器为基础的吸引者(EDA)计数的发言者以及使用Conv-TasNet(EDA)的语音分离。此外,我们提议了一个多种 1x1 革命层结构,用于估计与若干发言者相对应的隔离面罩,以及一种结合技术,用获得的语音二分化信息来改进分开的语音信号,以改进联合框架。使用LibriMix数据集进行的实验表明,我们拟议的方法超过了固定和灵活的发言者人数的分化和分离指标中的单任务基线,改进了发言者为灵活人数的发言者计数。所有材料都将开源并可在ESPnet工具包中复制。