The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.
翻译:最近提出的自觉集合(SAP)在几个发言者识别系统中表现良好。在SAP系统中,上下文矢量与地物提取器一起经过培训,上下文矢量与地物提取器一起经过端到端培训,后者的作用是选择最有区别的表达器识别框架。然而,SAP在某些环境下与时间平均集合(TAP)基线相比表现不佳,这意味着在端到端培训中没有有效地获得注意。为了解决这一问题,我们引入了以监督方式培训关注机制的战略,通过分类样本来学习上下文矢量。用我们建议的方法,上下文矢量可以被提升来选择信息最丰富的框架。我们表明,我们的方法超越了各种实验环境中的现有方法,包括短话筒识别,并在VoxCeleb数据集的现有基线上取得了竞争性业绩。