The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual featureslayer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6% to 24% on test sets over the audio modalityalone, depending on audio noise level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15 dB SNR experiment.
翻译:增强语音的目的是从若干来源产生的声音混合中提取目标语音信号。增强语音可能受益于来自目标发言者的视觉信息,如嘴动和面部表达,因为声音的视觉方面基本上不受声学环境的影响。为了整合音频和视觉信息,提出了视听融合战略,该战略超越简单的特征融合,学会自动调整两种模式,从而在吵闹的条件下提高能见度。拟议模型将视听特征层按层进行整合,并将这些视听特征提供给相应的解码层。实验结果显示,视音响程度而定,在音频模式单体上测试器上从6%到24%的相对改善。此外,在我们 - 15 DNR 实验中,PESQ从1.21%增加到2.06。