Deep convolutional neural networks (CNNs) have been applied to extracting speaker embeddings with significant success in speaker verification. Incorporating the attention mechanism has shown to be effective in improving the model performance. This paper presents an efficient two-dimensional convolution-based attention module, namely C2D-Att. The interaction between the convolution channel and frequency is involved in the attention calculation by lightweight convolution layers. This requires only a small number of parameters. Fine-grained attention weights are produced to represent channel and frequency-specific information. The weights are imposed on the input features to improve the representation ability for speaker modeling. The C2D-Att is integrated into a modified version of ResNet for speaker embedding extraction. Experiments are conducted on VoxCeleb datasets. The results show that C2DAtt is effective in generating discriminative attention maps and outperforms other attention methods. The proposed model shows robust performance with different scales of model size and achieves state-of-the-art results.
翻译:深相神经网络(CNNs)已被应用来提取在语音校验中非常成功的语音嵌入器。 将关注机制纳入到关注机制中可以有效地改进模型性能。 本文展示了一个高效的二维共振关注模块, 即 C2D- Att。 共振频道和频度之间的相互作用涉及轻量相振层的注意力计算。 这只需要少量参数。 生成精细的引力权重以代表频道和频率信息。 将精细的引力权重加到输入功能上, 以提高语音建模的代表能力。 C2D- Att 被整合到一个经过修改的ResNet 版本中, 用于嵌入语音提取。 实验在 VoxCeleb 数据集上进行。 结果表明, C2Dat 能够有效地生成有区别的注意力地图, 并超越其他关注方法。 拟议的模型显示了不同模型规模的强性性表现, 并取得了最新的结果 。