Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can efficiently address the joint modeling of large-scale speakers and predict high-resolution voice activities. Experimental results show that larger speaker capacity and higher output resolution can significantly reduce the diarization error rate (DER), which achieves the new state-of-the-art performance of 4.55% on the VoxConverse test set and 10.77% on Track 1 of the DIHARD-III evaluation set under the widely-used evaluation metrics.
翻译:目标发言人语音活动探测目前是一种很有希望的方法,有助于在复杂的声学环境中对发言者进行分解,本文件介绍了一种新型的顺序到顺序的目标发言人语音活动探测(Seq2Seqeq-TSVAD)方法,这种方法能够有效地解决大规模发言者联合建模问题,并预测高分辨率的语音活动。实验结果表明,更大的音量和更高的产出分辨率可以显著降低分解错误率(DER),这在VoxConversion测试集上实现了4.55%的新状态,在广泛使用的评价指标下实现了DIHARD-III评价集第1轨上实现了10.77%的新状态,在VoxConversion测试集上实现了4.55%的分解误差率(DER)和10.77%的DHARD-III评价集第1轨。