Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple acoustic aspects that make up a speaker's identity, whilst being robust to non-speaker acoustic variation. Deep speaker embeddings are normally trained discriminatively, predicting speaker identity labels on the training data. We hypothesise that additionally predicting speaker-related auxiliary variables -- such as age and nationality -- may yield representations that are better able to generalise to unseen speakers. We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application. On a test set of US Supreme Court recordings, we show that by leveraging two additional forms of speaker attribute information derived respectively from the matched training data, and VoxCeleb corpus, we improve the performance of our deep speaker embeddings for both verification and diarization tasks, achieving a relative improvement of 26.2% in DER and 6.7% in EER compared to baselines using speaker labels only. This improvement is obtained despite the auxiliary labels having been scraped from the web and being potentially noisy.
翻译:深层语音嵌入器已成为在语音识别任务中编码发言者身份的主要方法。 嵌入空间最好能捕捉所有可能的发言者之间的变异,将构成发言者身份的多种声学因素编码起来,同时对非声学变异保持稳健。 深层语音嵌入器通常经过有区别的培训,在培训数据上预测发言者身份标签。 我们假设,另外预测与发言者有关的辅助变量 -- -- 如年龄和国籍 -- -- 可能产生更能向看不见的发言者概括化的表述。 我们提议了一个使用辅助标签信息的框架,即使仅用于与目标应用程序不匹配的语音公司。 在对美国最高法院的一组记录进行测试时,我们表明,通过利用另外两种形式的语音属性信息,分别来自匹配的培训数据以及VoxCelebposiro,我们提高了我们的深层语音嵌入器在核查和分解任务上的性能,从而实现仅使用语音标签的ER26.2%和6.7%与基线相对的改进。 尽管辅助标签已经从网络上废弃,而且可能已经变得焦燥。