Speaker embeddings represent a means to extract representative vectorial representations from a speech signal such that the representation pertains to the speaker identity alone. The embeddings are commonly used to classify and discriminate between different speakers. However, there is no objective measure to evaluate the ability of a speaker embedding to disentangle the speaker identity from the other speech characteristics. This means that the embeddings are far from ideal, highly dependent on the training corpus and still include a degree of residual information pertaining to factors such as linguistic content, recording conditions or speaking style of the utterance. This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures, and in particular, the degree to which they are able to truly disentangle the speaker identity from the speech signal. To correctly evaluate the architectures, a large multi-speaker parallel speech dataset is used. The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments. The analysis looks into the intra- and inter-speaker similarity measures computed over the different embedding sets, as well as if simple classification and regression methods are able to extract several residual information factors from the speaker embeddings. The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations in the form of a high correlation to the recording conditions, linguistic contents and utterance duration.
翻译:发言人嵌入器是将代表矢量表示从一个语音信号中提取代表矢量表示的一种方法,这种表达方式使代表仅与发言者身份有关。嵌入器通常用于对不同发言者进行分类和区分;然而,没有客观的措施来评价一个发言人嵌入将发言者身份与其他语言特征脱钩的能力。这意味着嵌入器远非理想,高度依赖培训内容,仍然包含一定的剩余信息,涉及语言内容、记录条件或发言风格等要素。本文介绍了对六组发言人嵌入与一些最新和高性能DNNN结构中某些最新和高性能的插入器的分析,特别是他们能够真正将发言者身份与语音信号脱钩的能力。为了正确评价这些结构,使用了大型多语种平行语音数据集。数据集包括46个发言者,说出了同样一套提示,仍记录在专业工作室或其主场环境中。 本文介绍了对内部和旁听器的类似措施的分析,根据不同嵌入器和高性DNNNN结构中的一些最新和高性结构,特别是他们能够真正使发言者身份与发言信号分解的程度。