Modern speaker verification systems primarily rely on speaker embeddings, followed by verification based on cosine similarity between the embedding vectors of the enrollment and test utterances. While effective, these methods struggle with multi-talker speech due to the unidentifiability of embedding vectors. In this paper, we propose Neural Scoring (NS), a refreshed end-to-end framework that directly estimates verification posterior probabilities without relying on test-side embeddings, making it more robust to complex conditions, e.g., with multiple talkers. To make the training of such an end-to-end model more efficient, we introduce a large-scale trial e2e training (LtE2E) strategy, where each test utterance pairs with a set of enrolled speakers, thus enabling the processing of large-scale verification trials per batch. Experiments on the VoxCeleb dataset demonstrate that NS consistently outperforms both the baseline and competitive methods across various conditions, achieving an overall 70.36% reduction in EER compared to the baseline.
翻译:暂无翻译