Recently, attention mechanisms have been applied successfully in neural network-based speaker verification systems. Incorporating the Squeeze-and-Excitation block into convolutional neural networks has achieved remarkable performance. However, it uses global average pooling (GAP) to simply average the features along time and frequency dimensions, which is incapable of preserving sufficient speaker information in the feature maps. In this study, we show that GAP is a special case of a discrete cosine transform (DCT) on time-frequency domain mathematically using only the lowest frequency component in frequency decomposition. To strengthen the speaker information extraction ability, we propose to utilize multi-frequency information and design two novel and effective attention modules, called Single-Frequency Single-Channel (SFSC) attention module and Multi-Frequency Single-Channel (MFSC) attention module. The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT. We conduct comprehensive experiments on the VoxCeleb datasets and a probe evaluation on the 1st 48-UTD forensic corpus. Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations and outperform ResNet34-SE and ECAPA-TDNN systems with relative 20.9% and 20.2% reduction in EER, without adding extra network parameters.
翻译:最近,关注机制成功地应用于以神经网络为基础的扬声器校验系统。将挤压和刺激区块纳入进化神经网络已经取得了显著的绩效。然而,它使用全球平均集合(GAP)来简单地平均时间和频率方面的特征,这无法在地貌图中保存足够的语音信息。在本研究中,我们表明GAP是仅使用频率分解中最低频率的频率组件在时频域上进行离散连线变换(DCT)的特殊案例。为了加强演讲者的信息提取能力,我们提议利用多频率信息并设计两个新颖和有效的关注模块,称为单频单声器关注模块(GAP)和多频单声器单声器单声器关注模块(MFSC),无法在地图中保存足够的语音信息。在DCT的基础上,拟议的关注模块可以有效地从多个频率组件中获取更多的语音信息。我们对VoxCeleb数据集进行全面实验,并对1-UTD法证系统进行深入评估。实验的结果表明,我们提议的SFSC和EC-SEFSE-MC 20 和RESE-SE-SE-SUDSUDSUD 20 的相对代表系统可以更高效地添加20的20和20调制调制调调调调调调调调调调调调调调调调。