Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from an overlapped speech. We design a network that can extract a high-level embedding that contains information about each speaker's identity from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels of the overlapped speech segments. We demonstrate the effectiveness and usefulness of our algorithm in a speaker verification task and a speech separation system conditioned on the target speaker embeddings obtained through the proposed method.
翻译:许多方法都可以从演讲中获取关于单一发言者身份的信息,通过学习承认声学参数的一致特征;然而,在特定信号中同时有多个发言者时,确定身份信息是困难的;在本文中,我们提议了一个新的深层发言者代表战略,可以可靠地从重叠的演讲中提取多个发言者身份;我们设计了一个网络,可以提取包含特定混合中每个发言者身份信息的高级别嵌入器;与需要参考声学特征的常规方法不同,我们提议的算法只要求将重叠的演讲部分标为发言者身份标签;我们展示了我们通过拟议方法所获取的语音算法和语音分离系统,以目标发言者嵌入为条件,以此为条件,在语音核查任务中显示我们的算法的有效性和有用性。