Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the embedding of the claimed speaker. However, such approaches assume that only a single speaker exists in the input. The presence of concurrent speakers is likely to have detrimental effects on the performance. To address SV in a multi-speaker environment, we propose an end-to-end deep learning-based SV system that detects whether the target speaker exists within an input or not. First, an embedding is estimated from a reference utterance to represent the target's characteristics. Second, frame-level features are estimated from the input mixture. The reference embedding is then fused frame-wise with the mixture's features to allow distinguishing the target from other speakers on a frame basis. Finally, the fused features are used to predict whether the target speaker is active in the speech segment or not. Experimental evaluation shows that the proposed method outperforms the x-vector in multi-speaker conditions.
翻译:在现代人体机器界面中,验证发言者的身份至关重要,例如,确保隐私保护或能够进行生物鉴别认证。古典演讲者核查(SV)方法估计,从一个语音说明中固定地嵌入一个固定的维基嵌入,该语音说明编码了发言者的语音特征。如果他/她的语音嵌入与所声称的发言者的嵌入十分相似,则进行校验。然而,这种方法假定,投入中只存在一个单一的发言者。同时发言者的存在可能会对性能产生有害影响。为了在多语种环境中处理SV,我们建议一个基于深层次学习的SV系统,以检测目标演讲者是否在输入中。首先,嵌入是用一个引用语说明目标特性的估计。第二,从输入混合物中估算出框架级特征。然后将参考嵌入框架与混合物的特性结合,以便能够在框架基础上区分目标对象与其他发言者。最后,使用连接的特征来预测目标演讲者是否活跃于一个语音说明部分或多语言说明仪表显示的拟议条件。