End-to-end neural diarization (EEND) with self-attention directly predicts speaker labels from inputs and enables the handling of overlapped speech. Although the EEND outperforms clustering-based speaker diarization (SD), it cannot be further improved by simply increasing the number of encoder blocks because the last encoder block is dominantly supervised compared with lower blocks. This paper proposes a new residual auxiliary EEND (RX-EEND) learning architecture for transformers to enforce the lower encoder blocks to learn more accurately. The auxiliary loss is applied to the output of each encoder block, including the last encoder block. The effect of auxiliary loss on the learning of the encoder blocks can be further increased by adding a residual connection between the encoder blocks of the EEND. Performance evaluation and ablation study reveal that the auxiliary loss in the proposed RX-EEND provides relative reductions in the diarization error rate (DER) by 50.3% and 21.0% on the simulated and CALLHOME (CH) datasets, respectively, compared with self-attentive EEND (SA-EEND). Furthermore, the residual connection used in RX-EEND further relatively reduces the DER by 8.1% for CH dataset.
翻译:端到端神经二亚化( EEND) 具有自我注意的端到端神经二亚化( END) 直接预测输入中的语标标签, 并能处理重叠的语音。 虽然 EEND 优于基于组群的语音二亚化( SD), 但无法通过简单地增加编码器区块数量来进一步改进它,因为与下块相比,最后一个编码器区块受到主要监督。 本文建议为变异器建立一个新的剩余辅助 EEND( REX- EEND) 学习架构, 以实施较低的编码器区块, 以更准确地学习。 辅助损失适用于每个编码器块的输出, 包括最后一个编码器区块。 辅助损失对加密器区块学习的影响可以通过在 EEND 的编码区块之间添加剩余连接而进一步提高。 考绩和通货膨胀研究表明, 拟议的RX- EEND( SA- EEND) 中的辅助损失使模拟和 AREHME (CH) 数据集分别减少50.3%和21.0%, 与ENDEDER 使用的数据进一步减少。