Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning. Most of the existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance of which is either highly dependent on sample mining strategy or restricted by insufficient label information in the mini-batch. Proxy-based losses mitigate both shortcomings, however, fine-grained connections among entities are either not or indirectly leveraged. This paper proposes a Masked Proxy (MP) loss which directly incorporates both proxy-based relationships and pair-based relationships. We further propose Multinomial Masked Proxy (MMP) loss to leverage the hardness of speaker pairs. These methods have been applied to evaluate on VoxCeleb test set and reach state-of-the-art Equal Error Rate(EER).
翻译:开放式的语音识别可被视为一个衡量标准学习问题,即最大限度地提高阶级间差异和尽量减少阶级间差异。受监督的计量学习可分为基于实体的学习和代理式学习。现有的多数衡量标准学习目标,如对比性、特里普莱特、普罗托默性、GE2E等,都属于前司,其业绩或高度依赖采样采矿战略,或受到小型批量标签信息不足的限制。但基于代理的亏损缓解了这两个缺陷,但各实体之间微小的断层连接不是没有,就是间接得到杠杆。本文提议了一种蒙面的代理(MP)损失,直接包括代用关系和对对口关系。我们进一步提议多盘面的蒙面代理(MMP)损失,以利用发言对手的硬性。这些方法被用来评价VoxCeleb测试集,并达到最新水平的“平等错误率 ” 。