Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.
翻译:以吸引者为主的端对端对端对端对端对齐正实现与仔细调整的关于具有挑战性的数据集的常规集成法相似的精确度。 但是,主要缺点在于它无法处理发言者人数比培训期间所观察到的要多的情况。 这是因为其发言者的计数依赖于监督的学习。 在这项工作中,我们引入了一个嵌入以吸引者为主的端对端对端对端对齐的无监督的集成过程。 我们首先将一组框架式嵌入短后继序列的顺序进行分解,然后对每个子序列进行以吸引者为主的对齐。 鉴于次序列对齐的分化结果, 后继式的发言者通信是通过从所有次序列的吸引者中计算到不受监督的矢量的组合获得的。 这使得有可能产生大量发言者对整个记录的分化结果,即使每个子序列的产出演讲者人数有限,我们的方法可以精确地对每个子序列进行分解结果。 鉴于次序列的分解结果, 从次序列的分解结果中,每个主机的分数为MAL III。