The challenges in applying contrastive learning to speaker verification (SV) are that the softmax-based contrastive loss lacks discriminative power and that the hard negative pairs can easily influence learning. To overcome these challenges, we propose a contrastive learning SV framework incorporating an additive angular margin into the supervised contrastive loss. The margin improves the speaker representation's discrimination ability. We introduce a class-aware attention mechanism through which hard negative samples contribute less significantly to the supervised contrastive loss. We also employed a gradient-based multi-objective optimization approach to balance the classification and contrastive loss. Experimental results on CN-Celeb and Voxceleb1 show that this new learning objective can cause the encoder to find an embedding space that exhibits great speaker discrimination across languages.
翻译:为克服这些挑战,我们提议了一个对比式的SV学习框架,将一个角边边的添加剂纳入监督的对比性损失中。比值提高了演讲人代表的歧视能力。我们引入了一个对级关注机制,通过这一机制,硬性负面样本对监督的对比性损失贡献不大。我们还采用了基于梯度的多目标优化方法来平衡分类和对比性损失。CN-Celeb和Voxceleb1的实验结果显示,这一新学习目标可以使编码器找到一个在语言上表现出巨大歧视的嵌入空间。