Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 18% on AMI, 17% on DIHARD 3, and 16% on VoxConverse.
翻译:音量断裂包括将一个或一个以上发言者之间的对话分成一个或几个发言者之间的音频旋转。 通常以三个子任务( 语音活动检测、 语音变化检测和重叠语音检测) 的延迟组合处理, 我们提议对一个端到端截断模式进行直接操作。 在最初的端到端神经扩音器分解法( EEND) 的启发下, 任务以多标签分类问题为模型, 使用超时变换培训。 主要的区别是, 我们的模型运行在短音块上( 5 秒), 但时间分辨率要高得多( 每16米 ) 。 对多个发言者的diarization 数据集的实验结论是, 我们的模型可以在语音活动检测和重叠语音检测两方面都非常成功地使用。 我们提议的模型也可以用作后处理步骤, 检测和正确分配重叠的语音区域。 相对于最佳考虑基线( VBx) 的相对diar化率改进在 AMI 上达到 18%, DIHARD 3 17% 和 Vox Converst 16% 上达到 。