Many approaches exist for deriving a single speaker's identity information from speech by recognizing consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a speech signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from overlapped speech. We design a network that can extract a high-level embedding containing the identity information of each speaker from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels for the overlapped speech segments. We demonstrate the effectiveness of our algorithm in a speaker verification task and a speech separation system conditioned on the target speaker embeddings obtained through the proposed method.
翻译:通过承认声学参数的一致特点,从演讲中得出单一发言者的身份信息有许多办法;然而,在有多个同时发言的发言者在讲话信号中确定身份信息是困难的;在本文件中,我们提出一个新的深层发言者代表战略,可以可靠地从重叠的演讲中提取多个发言者的身份;我们设计一个网络,从某一混合中提取包含每个发言者身份信息的高级别嵌入器;与需要参考声学特征来进行培训的常规方法不同,我们提议的算法只要求重叠的演讲部分的发言者身份标签;我们展示了我们通过拟议方法获取的目标发言者嵌入器的算法和语音分离系统的有效性。