Despite speaker verification has achieved significant performance improvement with the development of deep neural networks, domain mismatch is still a challenging problem in this field. In this study, we propose a novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker-related feature space solely. Instead of performing domain adaptation directly on the feature space where domain information is not removed, using disentanglement can efficiently boost adaptation performance. To be specific, our model's input speech from the source and target domains is first encoded into different latent feature spaces. The adversarial domain adaptation is conducted on the shared speaker-related feature space to encourage the property of domain-invariance. Further, we minimize the mutual information between speaker-related and domain-specific features for both domains to enforce the disentanglement. Experimental results on the VOiCES dataset demonstrate that our proposed framework can effectively generate more speaker-discriminative and domain-invariant speaker representations with a relative 20.3% reduction of EER compared to the original ResNet-based system.
翻译:尽管通过开发深层神经网络,发言者的核查工作取得了显著的绩效改进,但域错配仍然是这一领域的一个棘手问题。在本研究中,我们提议了一个新框架,以解开与发言者有关的和特定领域的特性,并仅对与发言者有关的特性空间适用域性调整。而不是直接在未删除域信息的空间进行域性调整,而是使用分解能够有效地提高适应性。具体地说,我们从源和目标领域得到的模型投入演讲首先被编码为不同的潜在特性空间。对域的对抗性调整是在共用的与发言者有关的特性空间上进行的,以鼓励域变异特性的特性。此外,我们尽可能减少这两个领域的与发言者有关的特性和特定领域之间的相互信息,以强制执行分解。VoiCES数据集的实验结果表明,我们提议的框架能够有效地产生更多的发言者差异性和域变式演讲者表达方式,与原始的ResNet系统相比,ER减少20.3%。