This manuscript proposes a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is performed via independent vector extraction (IVE). Since the blind IVE cannot distinguish the target source by itself, it is guided towards the SOI via frame-wise speaker identification based on deep learning. Still, an incorrect speaker can be extracted due to guidance failings, especially when processing challenging data. To identify such cases, we propose a criterion for non-intrusively assessing the estimated speaker. It utilizes the same model as the speaker identification, so no additional training is required. When incorrect extraction is detected, we propose a ``deflation'' step in which the incorrect source is subtracted from the mixture and, subsequently, another attempt to extract the SOI is performed. The process is repeated until successful extraction is achieved. The proposed procedure is experimentally tested on artificial and real-world datasets containing challenging phenomena: source movements, reverberation, transient noise, or microphone failures. The method is compared with state-of-the-art blind algorithms as well as with current fully supervised deep learning-based methods.
翻译:该手稿提出了从音频来源的混合中提取有兴趣的发言者(SOI)的新而有力的程序。对SOI的估计是通过独立矢量提取(Vive)进行的。由于盲人IVE本身无法区分目标来源,因此它通过深思熟虑的框架化扬声器识别方式向SOI方向前进。然而,由于指导失败,特别是处理具有挑战性的数据时,一个不正确的扬声器可以提取出一个不正确的扬声器(SOI),为了识别此类情况,我们提出了一个非无端评估估计发言者的标准。它使用与发言者识别相同的模式,因此不需要额外培训。在检测出错误的提取时,我们建议采取“deflation”步骤,从混合物中减去不正确的来源,然后再次尝试提取SOI。在成功提取之前,这一过程会重复。拟议的程序是在包含具有挑战性现象的人工和现实世界数据集上进行实验测试:源动、回动、瞬动、音噪音或麦克风失灵机。方法与目前完全受监督的深层学习方法进行比较。