The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.
翻译:例如,在鸡尾酒聚会期间,人类听觉皮层结合了视听信号,以便更好地了解言论。最近的研究表明,AV语音增强模型可以大大改善非常低的噪音信号(SNR)环境中的言语质量和智能性,与只听音的SE模型相比,在非常低的噪音信号(SNR)环境中,可以显著改善语言质量和智能性。然而,尽管在AV SE领域进行了大量研究,但开发低延迟度的实时处理模型仍是一个巨大的技术挑战。在本文件中,我们为低升级演讲者依赖AV SE提供了一个新的框架,可以概括一系列视觉和声响音。特别是,提议建立基因对抗网络,以解决AVSEE环境中视觉不完善的实际问题。此外,我们提议建立一个基于实时AV SEE模型的深神经网络,考虑到GAN为更强的清洁视觉演讲输出,以提供更强的SEEE。拟议框架是用最新的合成和真实的AV图像模拟模型, 展示了基于SEV模型的SEV模型的模型,并展示了最新、真实的SEV模型的SEV模型。