无限发言者人数的在线神经神经线分化 (Online Neural Diarization of Unlimited Numbers of Speakers)

A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally the results of each blocks are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduces a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.

翻译：本文描述了一种为无限人数的发言者进行离线和在线讲演者分化的方法。端对端神经对缓冲二分化( EEND) 已经通过将其表述为多标签分类问题,实现了对发言者的重叠感知分化。也通过引入以语器为基础的吸引器,对一些灵活的发言者进行了扩展。但是, 以吸引者为基础的 EEND 的发言者的输出数是经验性封顶的; 它无法处理在推断过程中出现的发言者人数高于培训期间的发言者人数的情况,因为对发言者进行充分监督的计票。我们的方法, EEND- GLA, 已经通过在基于吸引或标签的分类分类中引入非超常的组合组合组合来解决这个问题。在方法中,输入的输入音频音频音频音频音频音频音频音量首先分为小区块,然后对每个区进行以吸引者为基的分解。每个区内的发言者人数是有限的, 而每个区段内对发言者的直径调音量, 估计整个用户输入的顺序可以高于在线对 EENA 的升级方法, 。将EENDA 升级升级升级为最后在E- L 格式上显示EENA 的升级。, 升级法升级升级升级升级升级升级升级, 升级升级升级升级升级升级为, 升级为升级为升级为。