The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion. It is based on an attention mechanism which exploits both frequency and channel domain. We first apply existing SKA module to our baseline. Then we propose two SKA variants where the first variant is applied in front of the ECAPA-TDNN model and the other is combined with the Res2net backbone block. Through extensive experiments, we demonstrate that our two proposed SKA variants consistently improves the performance and are complementary when tested on three different evaluation protocols.
翻译:最近大多数最先进的演讲者核查结构采用了多级处理和频率频道关注机制。这些模型的进化层通常具有固定的内核大小,例如3或5。在本研究中,我们进一步为这一研究线作出贡献,利用选择性内核关注机制。SKA机制允许每个进化层以数据驱动的方式适应性地选择内核大小。它基于一种利用频率和频道域的注意机制。我们首先将现有的SKA模块应用于我们的基线。然后我们提出两个SKA变式,第一个变式在ECAPA-TDNN模型之前应用,另一个变式与Res2net主干块相结合。通过广泛的实验,我们证明我们提议的SKA变式不断改进业绩,在对三个不同的评价协议进行测试时是相辅相成的。