Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.
翻译:目标语音提取是一种技术,用预先录制的注册语句从混音信号中提取目标演讲者的声音,这是目标演讲者声音特点的特点。目标语音提取的一个主要困难在于处理“内播音者”特征的变异性,即目标演讲和录制语句之间的特征不匹配。虽然大多数常规方法侧重于改进平均性能,但考虑到一系列的注册语句,我们在这里建议保证最差的性能,我们认为这在实际中非常重要。在这项工作中,我们提议了一个评价指标,称为最差的加压源对扭曲率(SDR),以量化衡量对招录变化的稳健性。我们还采用了一个新的培训计划,目的是直接优化最坏的性能,重点是在难于录用的情况下进行培训。此外,我们调查辅助演讲者识别损失(SI-loss)的功效,作为提高招生能力的另一个方法。实验性验证显示最差的升级目标培训和SI-loss培训的有效性,以便通过提高演讲者对可接受性变的稳性。