This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited.
翻译:本文建议了一种新型的典型损失和混音器混音器校验的新配方。 混合是一种简单而有效的数据增强技术,为深神经网络培训搭配随机数据点和标签配对的加权组合。 混合由于能够提高深度神经网络的稳健性和普及性而引起越来越多的关注。 虽然混音在不同的领域取得了成功, 但大多数应用都围绕封闭式分类任务进行。 在这项工作中,我们提出了对比式混合, 一种新的增强战略, 学习基于距离指标的区别表达方式。 在培训期间, 混合操作产生输入和虚拟标签的混音器交错。 此外, 我们重新设置了原型损失功能, 从而能够根据计量学习目标进行混合。 由于培训数据有限, 我们通过改变VoxCeleb 数据库中每个发言者的可用发音数量来进行实验。 实验结果显示, 应用对比式混音比现有基线更符合标准, 相对而言, 将错误率减少16 %, 特别是当每个演讲者的培训预言的数量有限时。