Modern speaker verification systems primarily rely on speaker embeddings and cosine similarity. While effective, these methods struggle with multi-talker speech due to the unidentifiability of embedding vectors. We propose Neural Scoring (NS), a novel end-to-end framework that directly estimates verification posterior probabilities without relying on test-side embeddings, making it more powerful and robust to complex conditions, e.g., with multiple talkers. To address the challenge of training such end-to-end models, we introduce a multi-enrollment training strategy, which pairs each test utterance with multiple enrolled speakers and proves essential to the model's success. Experiments on the VoxCeleb dataset demonstrate that NS consistently outperforms both the baseline and several competitive methods, achieving an overall 70.36% reduction in Equal Error Rate (EER) compared to the baseline.
翻译:暂无翻译