自我监督聚类感知 DINO 架构用于高性能鲁棒的说话人验证 (Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification)

Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

翻译：自动说话人验证任务使用大规模手工注释数据集的深度学习方法取得了巨大的成就。然而，为系统构建收集大量清晰标记的数据非常困难且昂贵。在本文中，我们提出了一种新的高级自我监督学习框架，可以构建出一个高性能的说话人验证系统，而不需要使用任何标记数据。为了避免虚假负样本的影响，我们采用自监督无标签蒸馏(DINO)框架作为初始模型，而该模型在训练时不涉及负样本。然后，我们引入了一种聚类感知训练策略，以提高数据多样性。在迭代学习阶段，由于聚类带来了大量不可靠的标签，因此伪标签的质量对于系统训练非常重要。这激励我们提出了动态损失门和标签修正（DLG-LC）方法，以弥补不可靠标签导致的性能下降。具体而言，我们使用GMM对损失分布进行建模，并动态获取损失门阈值以区分可靠和不可靠标签。此外，我们采用模型预测来纠正不可靠标签，以更好地利用不可靠数据而不是直接抛弃它们。此外，我们将DLG-LC扩展到多模态以进一步提高性能。实验在常用的 Voxceleb 数据集上进行。与已知的最佳自我监督说话人验证系统相比，我们提出的方法在 Vox-O、Vox-E 和 Vox-H 测试集上获得了22.17%、27.94% 和 25.56% 的相对 EER 改进，即使迭代次数更少，模型更小，聚类方法更简单也是如此。更重要的是，新提出的系统甚至在不使用任何人工标记数据的情况下，也可以达到与完全监督系统相当的结果。