Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
翻译:发言人核查大多是在单一跟踪器条件下研究的,在干扰演讲者在场的情况下受到不利影响。根据对目标演讲者提取(例如SpEx)的研究,我们提议一个单一和多对话者演讲的统一演讲者核查框架,能够对目标演讲者有选择性地给予听力注意。这个目标演讲者核查(tSV)框架通过多任务学习,共同优化一个演讲者注意模块和一个演讲者代表模块。我们研究四个不同的目标演讲者在tSV框架下嵌入计划。实验结果显示,所有四个目标演讲者嵌入计划都大大优于多对话者演讲的其他竞争性解决方案。值得注意的是,最佳的tSV演讲者嵌入计划在WSJ0-2mix-extr和Libri2Mix Corora基线系统中实现了76.0%和55.3%的相对改进,在2个对话者演讲的同等比率上,而单一对话者演讲者演讲的性能与传统演讲者核查制度相同,在同一个条件下得到训练和评价。