Speaker verification is to judge the similarity of two unknown voices in an open set, where the ideal speaker embedding should be able to condense discriminant information into a compact utterance-level representation that has small intra-speaker distances and large inter-speaker distances.We propose a novel model named Voice Transformer(VOT) for speaker verification. The model consists of multiple parallel Transformers, and the outputs of these Transformers are adaptively combined. Deeply-Fused Semantic Memory Network(DFSMN)is integrated into the attention parts of these Transformers to capture long-distance information and enhance the local dependencies. Statistical pooling layers are incorporated to enhance overall performance without significantly increasing the number of parameters. We propose a new loss function called Additive Angular Margin Focal Loss(AAMF) to address the hard sample mining issue.We evaluate the proposed approach on the VoxCeleb1 and CN-Celeb2 datasets. The experimental results demonstrate that VOT achieves state-of-the-art results, outperforming nearly all existing models. The code is available on GitHub.
翻译:暂无翻译