利用多式深演极深神经网络加强视听语音 (Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks)

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

翻译：增强语音语音(SE)的目的是减少语音信号中的噪音。大多数SE技术(SE)的目的是减少语音信号中的噪音。多数SE技术只侧重于处理音频信息。在这项工作中,在多式学习的启发下,利用来自不同模式的数据,以及SE的超演神经网络(CNNs)最近的成功,我们提议建立一个视听深度CNNS(AVDCNN) SE模型(AVDCNN)模型,将音像流流纳入统一的网络模式;我们还提议了一个多任务学习框架,用于在输出层重建视听信号。确切地说,AVDCNNN模型是作为一个视听编码解码网络的结构,首先利用单个CNNP处理音频和视觉数据,然后结合成一个联合网络,在输出层产生强化的语音(主要任务)和重建图像(第二任务),该模型以端到端的方式培训,并通过反演练来共同学习参数。我们用五项工具标准评价强化的演讲。结果显示,AVDCNNNNN模式与仅音频的S-SEA模型和两种常规SEDA方法相比具有显著的性表现。在SEEVA-SE-SE-NDFA和SE-SE-FA的视觉模型中有效地确认SE-SEF-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-