This paper is concerned with the task of speaker verification on audio with multiple overlapping speakers. Most speaker verification systems are designed with the assumption of a single speaker being present in a given audio segment. However, in a real-world setting this assumption does not always hold. In this paper, we demonstrate that current speaker verification systems are not robust against audio with noticeable speaker overlap. To alleviate this issue, we propose margin-mixup, a simple training strategy that can easily be adopted by existing speaker verification pipelines to make the resulting speaker embeddings robust against multi-speaker audio. In contrast to other methods, margin-mixup requires no alterations to regular speaker verification architectures, while attaining better results. On our multi-speaker test set based on VoxCeleb1, the proposed margin-mixup strategy improves the EER on average with 44.4% relative to our state-of-the-art speaker verification baseline systems.
翻译:本文关注具有多个重叠说话人的音频中说话人验证任务。大多数说话人验证系统的设计都基于给定音频片段中只有单个说话人的假设。但是,在实际环境中,这种假设并不总是成立的。在本文中,我们表明当前的说话人验证系统对于存在明显说话人重叠的音频不具有稳健性。为了解决这个问题,我们提出了Margin-Mixup,一种简单的训练策略,可以轻松地应用于现有的说话人验证管道,使得得到的说话人嵌入在多声道音频下具有稳健性。与其他方法相比,Margin-Mixup无需对常规说话人验证架构进行任何修改,同时获得更好的结果。在基于VoxCeleb1的多说话人测试集上,提出的Margin-Mixup策略相对于我们最先进的说话人验证基线系统平均提高了44.4%的EER。