Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. It is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.
翻译:目标语音提取( TSE) 抽取目标演讲者在一种混合物中的演讲, 其辅助线索具有演讲者的特点, 例如 招生语句 。 TSE 处理同时进行分离和语音识别的棘手问题。 最近发展了神经网络以加强语音和语音识别的神经网络之后,在提取性能方面取得了很大进展。 大多数研究都集中在目标演讲者积极发言的处理混合物上。然而,目标演讲者有时在实践中保持沉默,即不活跃的扬声器( IS ) 。 典型的 TSE 系统往往在IS 中发出信号, 造成虚假的警报。 这是技术服务系统实际部署的一个严重问题。 本文旨在更好地了解技术服务系统处理IS 案例的好坏。 我们考虑两种处理方法:(1) 培训直接输出零信号的系统,或者(2) 用一个额外扬声器核查模块检测信息。 我们用LibriMix数据集对这些方案进行广泛的实验性比较, 并展示其利Mix 数据集。