Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks.
翻译:最近,超球嵌入器已成为面部和声音识别的主要技术。具体地说,欧几里德空间矢量嵌入器学会了将特定个人的信息编码在它们的方向上,而忽略了信息的规模。然而,最近的研究表明,深神经网络所提取的嵌入器的规模可能表明相应投入的质量。本文件探讨了与质量评估和分配外检测有关的嵌入器规模的特性。我们建议使用嵌入尺寸编码的信息来设置一个新的概率演讲器嵌入器,并将之用于语音核查管道中。我们还提出了几种有质量意识的分解方法,并将这些分解的大小纳入其中。我们的结果表明,在语音核查和分解任务中,比数量-不可分辨基线有了显著的改进。