Recently, the performance of blind speech separation (BSS) and target speech extraction (TSE) has greatly progressed. Most works, however, focus on relatively well-controlled conditions using, e.g., read speech. The performance may degrade in more realistic situations. One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realistic speech. In this paper, we investigate the influence of emotions on TSE and BSS. We create a new test dataset of emotional mixtures for the evaluation of TSE and BSS. This dataset combines LibriSpeech and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Through controlled experiments, we can analyze the impact of different emotions on the performance of BSS and TSE. We observe that BSS is relatively robust to emotions, while TSE, which requires identifying and extracting the speech of a target speaker, is much more sensitive to emotions. On comparative speaker verification experiments we show that identifying the target speaker may be particularly challenging when dealing with emotional speech. Using our findings, we outline potential future directions that could improve the robustness of BSS and TSE systems toward emotional speech.
翻译:最近,失明言语分离和目标语音提取(TSE)的表现取得了很大进展,但大多数数据组都侧重于相对严格控制的条件,如读话等。性能可能会在更现实的情况下退化。造成这种退化的因素之一可能是语言的内在变异性,例如情绪,通常在现实的演讲中发生。我们在本文件中调查情绪对TSE和BSS的影响。我们为评价TSE和BSS创建了新的情感混合物测试数据集。这个数据集将LibriSpeech和Ryerson情感语音和歌曲视听数据库(RAVDES)结合起来。我们可以通过受控的实验分析不同情绪对BSS和TSE表现的影响。我们发现,BSS对情绪相对具有很强的活力,而TSE则需要识别和提取目标发言人的演讲,对情绪更为敏感。在比较的发言人核查实验中,我们发现,在处理情感演讲时,确定目标演讲人可能特别具有挑战性。我们利用我们的调查结果,勾画出未来的方向,可以改善BSS和TSE的情绪表达系统对情绪表达的活力。