Audio-visual speech enhancement system is regarded as one of promising solutions for isolating and enhancing speech of desired speaker. Typical methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) are not adequate to use data fully, b) are unable to effectively balance audio-visual features. The proposed model alleviates these drawbacks by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing a 2-stage multi-head cross attention (MHCA) mechanism to infer audio-visual speech enhancement for balancing the fused audio-visual features and eliminating irrelevant features. This paper proposes attentional audio-visual multi-layer feature fusion model, in which MHCA units are applied to feature mapping at every layer of decoder. The proposed model demonstrates the superior performance of the network against the state-of-the-art models.
翻译:典型的方法侧重于通过一个以编解码器-解码器为基础的天真神经网络结构预测清洁的语音频谱,这些方法a)不足以充分使用数据,b)无法有效地平衡视听特征;拟议模型减轻了这些缺点,其方法是:(a) 应用一种模型,在编码阶段将视听特征层按层层结合,并将合成的视听特征带入每个相应的解码层,更重要的是,(b) 引入一个两阶段多层注意机制,用以推断视听语音增强,以平衡装配的视听特征和消除无关特征;本文提出了注意的视听多层特征聚合模型,其中MHCA单元用于在调解码器的每一层进行地貌制图;拟议模型显示网络相对于最先进的模型的优性能。