Capturing long-range dependency and modeling long temporal contexts is proven to benefit speaker verification tasks. In this paper, we propose the combination of the Hierarchical-Split block(HS-block) and the Depthwise Separable Self-Attention(DSSA) module to capture richer multi-range context speaker features from a local and global perspective respectively. Specifically, the HS-block splits the feature map and filters into several groups and stacks them in one block, which enlarges the receptive fields(RFs) locally. The DSSA module improves the multi-head self-attention mechanism by the depthwise-separable strategy and explicit sparse attention strategy to model the pairwise relations globally and captures effective long-range dependencies in each channel. Experiments are conducted on the Voxceleb and SITW. Our best system achieves 1.27% EER on the Voxceleb1 test set and 1.56% on SITW by applying the combination of HS-block and DSSA module.
翻译:在本文中,我们提议将等级-斯普利特区块(HS-stritt)和深度分离自控(DSSA)模块结合起来,分别从当地和全球角度获取较丰富的多频谱语言特征。具体地说,HS区块将地貌图和过滤器分成若干组,并将它们堆叠在一个块中,扩大可接收域(RFs)的本地范围。DSS单元通过深度可分离战略和明显分散的注意战略改进多头自留机制,以模拟全球对称关系,并捕捉每个通道的有效长距离依赖关系。在Voxceleb和SITW上进行了实验。我们的最佳系统通过应用HS-区块和DSSA单元的组合,在Voxceleb1测试集和SITW上实现了1.27%的ER。