Teleconferencing is becoming essential during the COVID-19 pandemic. However, in real-world applications, speech quality can deteriorate due to, for example, background interference, noise, or reverberation. To solve this problem, target speech extraction from the mixture signals can be performed with the aid of the user's vocal features. Various features are accounted for in this study's proposed system, including speaker embeddings derived from user enrollment and a novel long-short-term spatial coherence (LSTSC) feature to the target speaker activity. As a learning-based approach, a target speech sifting network was employed to extract the relevant features. The network trained with LSTSC in the proposed approach is robust to microphone array geometries and the number of microphones. Furthermore, the proposed enhancement system was compared with a baseline system with speaker embeddings and interchannel phase difference. The results demonstrated the superior performance of the proposed system over the baseline in enhancement performance and robustness.
翻译:在COVID-19大流行期间,电信会议变得至关重要,但在实际应用中,由于背景干扰、噪音或反响等原因,语音质量可能恶化。为解决这一问题,可在用户声音功能的帮助下,从混合信号中进行定向语音提取。本研究的拟议系统考虑到各种特点,包括用户录用产生的语音嵌入和对目标演讲者活动具有新的长期短期空间一致性特征。作为一种基于学习的方法,采用了目标语音筛选网络来提取相关特征。在拟议方法中,与LSTSC培训的网络对麦克风阵列的地形和麦克风数量具有很强的功能。此外,拟议加强系统与基线系统进行了比较,将发言者嵌入和声波相差异与基线系统进行了比较。结果显示,拟议的系统在提高性能和稳健性方面的基线方面表现优于拟议系统。