We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.
翻译:我们提出了一个从单一频道录音进行会议分化的端到端深网络模型。端到端分化模型的优点是处理演讲人的重叠,并能够直接处理歧视性培训,这与传统的基于集群的分化方法不同。拟议的系统旨在处理人数不详的发言者会议,使用不同数目的变式变换-异性跨物种损失功能。我们引入了似乎有助于分化工作的几个组成部分,包括一个地方连动网络,随后有一个全球自控模块,利用一个语音识别组件进行多任务转移学习,以及一种连续方法,在第二个阶段对模型进行完善。这些是以基于Librispeech和LibriTTS数据集的模拟会议数据为培训和验证的;最后评价是使用LibriCSS进行,其中包括通过扩音器播放回放的真声记录模拟会议。拟议模型比先前提议的关于这些数据的端到端的分化模型要好。