As a practical alternative of speech separation, target speaker extraction (TSE) aims to extract the speech from the desired speaker using additional speaker cue extracted from the speaker. Its main challenge lies in how to properly extract and leverage the speaker cue to benefit the extracted speech quality. The cue extraction method adopted in majority existing TSE studies is to directly utilize discriminative speaker embedding, which is extracted from the pre-trained models for speaker verification. Although the high speaker discriminability is a most desirable property for speaker verification task, we argue that it may be too sophisticated for TSE. In this study, we propose that a simplified speaker cue with clear class separability might be preferred for TSE. To verify our proposal, we introduce several forms of speaker cues, including naive speaker embedding (such as, x-vector and xi-vector) and new speaker embeddings produced from sparse LDA-transform. Corresponding TSE models are built by integrating these speaker cues with SepFormer (one SOTA speech separation model). Performances of these TSE models are examined on the benchmark WSJ0-2mix dataset. Experimental results validate the effectiveness and generalizability of our proposal, showing up to 9.9% relative improvement in SI-SDRi. Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix.
翻译:作为语言隔离的一个实际替代办法,目标演讲者提取(TSE)的目的是利用从发言者中提取的额外发言者提示,从所希望的发言者中提取演讲词,其主要挑战在于如何正确提取和调用发言者提示,以有利于发言质量。多数现有 TSE 研究中采用的提示提取方法是直接使用歧视演讲者嵌入,这是从预先培训的演讲者校验模式中提取的。虽然高演讲者差异性是让发言者核查任务最可取的属性,但我们认为,对于TSE来说,它可能过于复杂。我们建议,TSE可能更喜欢使用一个具有明确等级分隔性的简化演讲者提示。为了验证我们的提议,我们采用了几种形式的演讲者提示,包括天真的嵌入(例如,X-Vexctor和xx-civector)和新演讲者嵌入式嵌入,通过将这些演讲者提示与Seporformer(一个SOTA发言分解模式)相结合。我们用基准WSJ0-2-MQS-QS-QSA的成绩, 实验性S-TIS-TIS-S-TIS-S-Syalalal Apprestal Applial 和Syal Appresentalalal resental Applishalalalalal 和SU SI-SIS Applipal Stalalal ressalsalal 和SIS Appalal ress 的SIMFI restialal 。