This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4s actual latency.
翻译:本文基于一个完全监督的自我注意机制(SA-END),提出了一个新的在线演讲者分解算法。 在线对称必然呈现出一个演讲者的分解问题, 原因是在记录中可能错误地分配发言区域。 为避免这种不一致, 我们提议了一个语音追踪缓冲机制, 选择代表发言者先前各块的变换信息的若干输入框, 并将其储存在缓冲中。 这些缓冲框与当前块中的输入框堆叠在一起, 并输入到一个自控网络中。 我们的方法通过检查缓冲和当前块之间的关联性, 来确保整个缓冲和当前块的分解输出一致。 此外, 我们用不同的块大小对 SA- EEND 进行了培训, 以缓解发言者缓冲机制引入的培训和推论之间的不匹配。 实验结果, 包括在线SA- EEND 和可变块大小, ACHOME实现了12.54%的D, CSJ为20.77%的D, 实际拉特。