This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.
翻译:本文提出了一种基于人工智能的新型视听语音增强系统,并对不同部署架构进行了性能比较分析。所提出的AVSE系统采用卷积神经网络进行频谱特征提取,并利用长短期记忆网络进行时序建模,通过音频与视觉线索的多模态融合实现鲁棒的语音增强。研究探讨了多种部署场景,包括云端、边缘辅助和独立设备实现。系统性能从语音质量提升、延迟和计算开销等方面进行评估。在实际网络条件下(包括以太网、Wi-Fi、4G和5G)进行了实验,以分析处理延迟、通信延迟与感知语音质量之间的权衡关系。结果表明,虽然云端部署能获得最佳的增强质量,但边缘辅助架构在延迟与可懂度之间取得了最佳平衡,在5G和Wi-Fi 6条件下能够满足实时性要求。这些发现为助听设备、远程呈现和工业通信等多样化应用场景中AVSE部署架构的选择与优化提供了实践指导。