Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user's vocal features. Various features are accounted for in this study's proposed system, including speaker embeddings derived from user enrollment and a novel long-short-term spatial coherence (LSTSC) feature to the target speaker activity. As a learning-based approach, a target speech sifting network was employed to extract the target speech signal. The network trained with LSTSC in the proposed approach is robust to microphone array geometries and the number of microphones. Furthermore, the proposed enhancement system was compared with a baseline system with speaker embeddings and interchannel phase difference. The results demonstrated the superior performance of the proposed system over the baseline in enhancement performance and robustness.
翻译:在COVID-19大流行期间,电信会议变得至关重要,但是,在现实应用中,由于背景干扰、噪音或反响等原因,语音质量可能恶化。为了解决这个问题,可以在用户声音功能的帮助下从混合信号中进行定向语音提取。本研究的拟议系统考虑到各种特点,包括由用户录用产生的语音嵌入,以及作为目标演讲者活动的新颖的长期短期空间一致性特征。作为一种以学习为基础的方法,采用了目标语音筛选网络来提取目标语音信号。在拟议方法中接受LSTSC培训的网络对麦克风阵列地形和麦克风数量具有很强的功能。此外,拟议的增强系统与基线系统进行了比较,将发言者嵌入和声波相差异作了比较。结果显示,拟议的系统在增强性能和稳健性方面的基线上表现优于拟议系统。