Neural network-based speaker recognition has achieved significant improvement in recent years. A robust speaker representation learns meaningful knowledge from both hard and easy samples in the training set to achieve good performance. However, noisy samples (i.e., with wrong labels) in the training set induce confusion and cause the network to learn the incorrect representation. In this paper, we propose a two-step audio-visual deep cleansing framework to eliminate the effect of noisy labels in speaker representation learning. This framework contains a coarse-grained cleansing step to search for the peculiar samples, followed by a fine-grained cleansing step to filter out the noisy labels. Our study starts from an efficient audio-visual speaker recognition system, which achieves a close to perfect equal-error-rate (EER) of 0.01\%, 0.07\% and 0.13\% on the Vox-O, E and H test sets. With the proposed multi-modal cleansing mechanism, four different speaker recognition networks achieve an average improvement of 5.9\%. Code has been made available at: \textcolor{magenta}{\url{https://github.com/TaoRuijie/AVCleanse}}.
翻译:近些年来,基于神经网络的语音识别工作取得了显著改善。一个强有力的发言者代表机构从培训中硬性和简便的样本中学习到有意义的知识,以取得良好的表现。然而,培训组的杂乱样本(即标签错误)引起混乱,并导致网络了解不正确的表述。在本文中,我们提议了一个两步声标的视听深度清洗框架,以消除音标学习中噪音标签的影响。这个框架包含一个粗糙的清洁步骤,以寻找特殊样本,随后是细微的清洗步骤,以过滤噪音标签。我们的研究始于一个高效的声标识别系统,该系统在Vox-O、E和H测试组上实现接近完美等速率的0.01、0.07 ⁇ 和0.13 ⁇ 。在拟议的多式清理机制下,四个不同的语音识别网络实现了5.9 ⁇ 的平均改进。代码公布在以下网址上:\ textcolora{mrogina-url{https://github.com/TAVAVIANS/AVALANS。